You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Aaron Kimball (JIRA)" <ji...@apache.org> on 2009/06/04 03:04:07 UTC

[jira] Created: (HADOOP-5967) Sqoop should only use a single map task

Sqoop should only use a single map task
---------------------------------------

                 Key: HADOOP-5967
                 URL: https://issues.apache.org/jira/browse/HADOOP-5967
             Project: Hadoop Core
          Issue Type: Improvement
            Reporter: Aaron Kimball
            Assignee: Aaron Kimball
            Priority: Minor
         Attachments: single-mapper.patch

The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.

This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5967) Sqoop should only use a single map task

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717330#action_12717330 ] 

Aaron Kimball commented on HADOOP-5967:
---------------------------------------

Hudson's test failures are unrelated...


> Sqoop should only use a single map task
> ---------------------------------------
>
>                 Key: HADOOP-5967
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5967
>             Project: Hadoop Core
>          Issue Type: Improvement
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>            Priority: Minor
>         Attachments: single-mapper.patch
>
>
> The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
> This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5967) Sqoop should only use a single map task

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716864#action_12716864 ] 

Hadoop QA commented on HADOOP-5967:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12409836/single-mapper.patch
  against trunk revision 782083.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/console

This message is automatically generated.

> Sqoop should only use a single map task
> ---------------------------------------
>
>                 Key: HADOOP-5967
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5967
>             Project: Hadoop Core
>          Issue Type: Improvement
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>            Priority: Minor
>         Attachments: single-mapper.patch
>
>
> The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
> This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5967) Sqoop should only use a single map task

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716131#action_12716131 ] 

Scott Carey commented on HADOOP-5967:
-------------------------------------

Some databases optimize multiple queries doing sequential scans on the same table at the same time by having them 'tag along' with the same sequential scan (Postgres, at least) which avoids the O( N^2 ) issue.  But LIMIT ... OFFSET is not guaranteed to return distinct, consistent partitions unless it has an ORDER BY clause and is in the same transaction anyway.

> Sqoop should only use a single map task
> ---------------------------------------
>
>                 Key: HADOOP-5967
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5967
>             Project: Hadoop Core
>          Issue Type: Improvement
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>            Priority: Minor
>         Attachments: single-mapper.patch
>
>
> The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
> This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5967) Sqoop should only use a single map task

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-5967:
------------------------------

       Resolution: Fixed
    Fix Version/s: 0.21.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

+1

I've just committed this. Thanks Aaron!

> Sqoop should only use a single map task
> ---------------------------------------
>
>                 Key: HADOOP-5967
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5967
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: single-mapper.patch
>
>
> The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
> This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5967) Sqoop should only use a single map task

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716186#action_12716186 ] 

Aaron Kimball commented on HADOOP-5967:
---------------------------------------

An ORDER BY clause is included in DBInputFormat's SQL statements that it sends over JDBC. But each mapper (necessarily) runs in a separate transaction, as it's on a separate node.

> Sqoop should only use a single map task
> ---------------------------------------
>
>                 Key: HADOOP-5967
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5967
>             Project: Hadoop Core
>          Issue Type: Improvement
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>            Priority: Minor
>         Attachments: single-mapper.patch
>
>
> The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
> This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5967) Sqoop should only use a single map task

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated HADOOP-5967:
----------------------------------

    Attachment: single-mapper.patch

This patch implements this as a one-liner. No new tests because it's trivial. I've verified that it passes existing unit tests, and also that it does indeed use a single mapper on a cluster.

> Sqoop should only use a single map task
> ---------------------------------------
>
>                 Key: HADOOP-5967
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5967
>             Project: Hadoop Core
>          Issue Type: Improvement
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>            Priority: Minor
>         Attachments: single-mapper.patch
>
>
> The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
> This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5967) Sqoop should only use a single map task

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated HADOOP-5967:
----------------------------------

    Status: Patch Available  (was: Open)

> Sqoop should only use a single map task
> ---------------------------------------
>
>                 Key: HADOOP-5967
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5967
>             Project: Hadoop Core
>          Issue Type: Improvement
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>            Priority: Minor
>         Attachments: single-mapper.patch
>
>
> The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
> This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.