You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@sqoop.apache.org by Peter Hall <Pe...@quest.com> on 2011/09/02 01:52:02 UTC

RE: [sqoop-user] Support for partitioning during export into HDFS

Hi Ken,

We did initially consider an approach similar to what you suggest, but decided not to go with it due to complexities when the number of mappers is different to the number of partitions. Instead we are breaking up the blocks in the table and spreading them across all the mappers and doing ROWID range scans. So all mappers could be reading from all partitions - but they would only be reading part of each. Splitting by PARTITION may provide slightly better performance, but we don't believe it would be a huge difference.

Regards,
Peter Hall
Quest Software

Re: [sqoop-user] Support for partitioning during export into HDFS

Posted by Ken Krugler <kk...@transpac.com>.

Hi David,

On Sep 4, 2011, at 9:01pm, David Robson wrote:

> Hi Ken,
>  
> We thought you wanted each mapper to run against an individual partition – if you want to just pull out a specific partition say the JAN2010 partition - you can achieve that in OraOop currently using a WHERE clause that would only pick that partition.

> OraOop would then split the blocks of the entire table across all the mappers – the problem of course a lot of those blocks would not match the WHERE clause and hence not be selecting any rows. From my testing – Oracle seems to be smart enough that if you request rowid ranges and also specify a WHERE clause on the partition key such that no rows could match – it realises this and does not actually read the blocks. So OraOop would get maximum performance in this case (aside from the slight overhead to work out there were no blocks in that range – which shouldn’t be much in the scheme of a big import job).
>  
> The main problem here is because OraOop is splitting across all the partitions – it could potentially be unbalanced and some mappers would process more rows than others. Are you experiencing problems with mappers being unbalanced? You can try changing the oraoop.block.allocation property to RANDOM (by default it is ROUNDROBIN).

No, I don't believe so - at least not when OraOop is actually getting used.

> Also Oracle should not spawn any parallel queries for OraOop queries – we have specifically disabled it as it would cause overhead on the database.

OK, great - that's good to know.

> So in summary – to get the best possible performance you should run one OraOop job that fetches all the rows you want – and OraOop should balance the data across all the mappers and try to minimise the load on the DB by only reading each block once. The ROWIDs we scan also have the partition ID in them so we only ever scan one partition at a time – then UNION ALL the results together – as you have probably seen if you trace a session.
>  
> If there is a particular import job you are trying to tune you could send us some details about it – some screenshots from OEM and log files etc and we could have a look at it with you.

Thanks for that offer.

We're improving performance (up to 50M rows/min best case) by tuning other parameters, so feeling better about the status.

Regards,

-- Ken


> From: Ken Krugler [mailto:kkrugler_lists@transpac.com] 
> Sent: Sunday, 4 September 2011 6:49 AM
> To: sqoop-user@incubator.apache.org
> Subject: Re: [sqoop-user] Support for partitioning during export into HDFS
>  
> Hi Peter,
>  
> On Sep 1, 2011, at 4:52pm, Peter Hall wrote:
> 
> 
> Hi Ken,
> 
> We did initially consider an approach similar to what you suggest, but decided not to go with it due to complexities when the number of mappers is different to the number of partitions.
>  
> OK - but note that what I'm asking for is the ability to restrict a given Sqoop import request to one partition.
>  
> We can run multiple of these in parallel, if that would improve our throughput.
>  
> Given that we've got billions of rows coming from a single DB, we're looking to maximize performance here, thus the interest in this topic.
> 
> 
> Instead we are breaking up the blocks in the table and spreading them across all the mappers and doing ROWID range scans. So all mappers could be reading from all partitions - but they would only be reading part of each.
>  
> Since the ROWID range scans are not partition specific, wouldn't this cause Oracle to spawn 16 parallel queries (one per partition)?
>  
> Also, typically wouldn't a range of ROWIDs be in one partition (or maybe two), if we have num mappers == num partitions?
>  
> So the queries in all the other partitions would match nothing.
>  
> Just wanting to make sure I understand what's happening under the hood, here.
>  
> Thanks,
>  
> -- Ken
> 
> 
> Splitting by PARTITION may provide slightly better performance, but we don't believe it would be a huge difference.
> 
> Regards,
> Peter Hall
> Quest Software
>  
> Hi there,
> 
> For maximum performance when pulling data, it seems like we'd want to run multiple Sqoops in parallel against the available partitions in a table.
> 
> That would require adding 'PARTITION <partition_name> to the select statement, something like:
> 
> select * from <table_name> PARTITION <partition_name> where <condition>;
> 
> 1. Does this make sense, both for general Sqoop and specifically OraOop?
> 
> 2. Is there a way to do this now, or would Sqoop (and OraOop) need to be extended?
> 
> Thanks,
> 
> -- Ken
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
> 
> 
>  
> --------------------------------------------
> http://about.me/kkrugler
> +1 530-210-6378
>  
>  
> 
>  

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

RE: [sqoop-user] Support for partitioning during export into HDFS

Posted by David Robson <Da...@quest.com>.

Hi Ken,

We thought you wanted each mapper to run against an individual partition - if you want to just pull out a specific partition say the JAN2010 partition - you can achieve that in OraOop currently using a WHERE clause that would only pick that partition.

OraOop would then split the blocks of the entire table across all the mappers - the problem of course a lot of those blocks would not match the WHERE clause and hence not be selecting any rows. From my testing - Oracle seems to be smart enough that if you request rowid ranges and also specify a WHERE clause on the partition key such that no rows could match - it realises this and does not actually read the blocks. So OraOop would get maximum performance in this case (aside from the slight overhead to work out there were no blocks in that range - which shouldn't be much in the scheme of a big import job).

The main problem here is because OraOop is splitting across all the partitions - it could potentially be unbalanced and some mappers would process more rows than others. Are you experiencing problems with mappers being unbalanced? You can try changing the oraoop.block.allocation property to RANDOM (by default it is ROUNDROBIN).

Also Oracle should not spawn any parallel queries for OraOop queries - we have specifically disabled it as it would cause overhead on the database.

So in summary - to get the best possible performance you should run one OraOop job that fetches all the rows you want - and OraOop should balance the data across all the mappers and try to minimise the load on the DB by only reading each block once. The ROWIDs we scan also have the partition ID in them so we only ever scan one partition at a time - then UNION ALL the results together - as you have probably seen if you trace a session.

If there is a particular import job you are trying to tune you could send us some details about it - some screenshots from OEM and log files etc and we could have a look at it with you.

David

From: Ken Krugler [mailto:kkrugler_lists@transpac.com]
Sent: Sunday, 4 September 2011 6:49 AM
To: sqoop-user@incubator.apache.org
Subject: Re: [sqoop-user] Support for partitioning during export into HDFS

Hi Peter,

On Sep 1, 2011, at 4:52pm, Peter Hall wrote:


Hi Ken,

We did initially consider an approach similar to what you suggest, but decided not to go with it due to complexities when the number of mappers is different to the number of partitions.

OK - but note that what I'm asking for is the ability to restrict a given Sqoop import request to one partition.

We can run multiple of these in parallel, if that would improve our throughput.

Given that we've got billions of rows coming from a single DB, we're looking to maximize performance here, thus the interest in this topic.


Instead we are breaking up the blocks in the table and spreading them across all the mappers and doing ROWID range scans. So all mappers could be reading from all partitions - but they would only be reading part of each.

Since the ROWID range scans are not partition specific, wouldn't this cause Oracle to spawn 16 parallel queries (one per partition)?

Also, typically wouldn't a range of ROWIDs be in one partition (or maybe two), if we have num mappers == num partitions?

So the queries in all the other partitions would match nothing.

Just wanting to make sure I understand what's happening under the hood, here.

Thanks,

-- Ken


Splitting by PARTITION may provide slightly better performance, but we don't believe it would be a huge difference.

Regards,
Peter Hall
Quest Software

Hi there,

For maximum performance when pulling data, it seems like we'd want to run multiple Sqoops in parallel against the available partitions in a table.

That would require adding 'PARTITION <partition_name> to the select statement, something like:

select * from <table_name> PARTITION <partition_name> where <condition>;

1. Does this make sense, both for general Sqoop and specifically OraOop?

2. Is there a way to do this now, or would Sqoop (and OraOop) need to be extended?

Thanks,

-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr



--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378

Re: [sqoop-user] Support for partitioning during export into HDFS

Posted by Ken Krugler <kk...@transpac.com>.

Hi Peter,

On Sep 1, 2011, at 4:52pm, Peter Hall wrote:

> Hi Ken,
> 
> We did initially consider an approach similar to what you suggest, but decided not to go with it due to complexities when the number of mappers is different to the number of partitions.

OK - but note that what I'm asking for is the ability to restrict a given Sqoop import request to one partition.

We can run multiple of these in parallel, if that would improve our throughput.

Given that we've got billions of rows coming from a single DB, we're looking to maximize performance here, thus the interest in this topic.

> Instead we are breaking up the blocks in the table and spreading them across all the mappers and doing ROWID range scans. So all mappers could be reading from all partitions - but they would only be reading part of each. 

Since the ROWID range scans are not partition specific, wouldn't this cause Oracle to spawn 16 parallel queries (one per partition)?

Also, typically wouldn't a range of ROWIDs be in one partition (or maybe two), if we have num mappers == num partitions?

So the queries in all the other partitions would match nothing.

Just wanting to make sure I understand what's happening under the hood, here.

Thanks,

-- Ken

> Splitting by PARTITION may provide slightly better performance, but we don't believe it would be a huge difference.
> 
> Regards,
> Peter Hall
> Quest Software

>> Hi there,
>> 
>> For maximum performance when pulling data, it seems like we'd want to run multiple Sqoops in parallel against the available partitions in a table.
>> 
>> That would require adding 'PARTITION <partition_name> to the select statement, something like:
>> 
>> select * from <table_name> PARTITION <partition_name> where <condition>;
>> 
>> 1. Does this make sense, both for general Sqoop and specifically OraOop?
>> 
>> 2. Is there a way to do this now, or would Sqoop (and OraOop) need to be extended?
>> 
>> Thanks,
>> 
>> -- Ken
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
> 
> 

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378