You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Spencer Ho (JIRA)" <ji...@apache.org> on 2009/11/20 01:32:39 UTC

[jira] Created: (MAPREDUCE-1224) Calling "SELECT t.* from AS t" to get meta information is too expensive for big tables
Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
----------------------------------------------------------------------------------------------

                 Key: MAPREDUCE-1224
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: contrib/sqoop
    Affects Versions: 0.20.1
         Environment: all platforms, generic jdbc driver
            Reporter: Spencer Ho


The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1224) Calling "SELECT t.* from

AS t" to get meta information is too expensive for big tablesPosted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1224:
-------------------------------------

    Hadoop Flags: [Reviewed]

> Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1224
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>    Affects Versions: 0.20.1
>         Environment: all platforms, generic jdbc driver
>            Reporter: Spencer Ho
>         Attachments: MAPREDUCE-1224.patch, SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1224) Calling "SELECT t.* from

AS t" to get meta information is too expensive for big tablesPosted by "Spencer Ho (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Spencer Ho updated MAPREDUCE-1224:
----------------------------------

    Attachment: SqlManager.java

The original code from line 66 to 68 of SqlManager was

protected String getColNamesQuery(String tableName) {
    return "SELECT t.* FROM " + tableName + " AS t";
}

As this method was invoked three times in the code to generated column name and type information, it queries the database three times.  For a large table, it makes the whole loading work to query the whole table four time.

The change made is to add an always-false where clause that forces db to return zero-size result set yet with meta data. (from line 66 to 69)

  protected String getColNamesQuery(String tableName) {
    // adding where clause to prevent loading a big table
    return "SELECT t.* FROM " + tableName + " AS t WHERE 1=0";
  }

The execution time for retrieving one of the large tables we have reduced from 40 minutes to 11 minutes.

> Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1224
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>    Affects Versions: 0.20.1
>         Environment: all platforms, generic jdbc driver
>            Reporter: Spencer Ho
>         Attachments: SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1224) Calling "SELECT t.* from

AS t" to get meta information is too expensive for big tablesPosted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated MAPREDUCE-1224:
-----------------------------------

    Status: Patch Available  (was: Open)

Marking patch available to trigger Hudson

> Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1224
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>    Affects Versions: 0.20.1
>         Environment: all platforms, generic jdbc driver
>            Reporter: Spencer Ho
>         Attachments: MAPREDUCE-1224.patch, SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1224) Calling "SELECT t.* from

AS t" to get meta information is too expensive for big tablesPosted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated MAPREDUCE-1224:
---------------------------------

       Resolution: Fixed
    Fix Version/s: 0.22.0
         Assignee: Spencer Ho
           Status: Resolved  (was: Patch Available)

I've just committed this. Thanks Spencer!

> Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1224
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>    Affects Versions: 0.20.1
>         Environment: all platforms, generic jdbc driver
>            Reporter: Spencer Ho
>            Assignee: Spencer Ho
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1224.patch, SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1224) Calling "SELECT t.* from

AS t" to get meta information is too expensive for big tablesPosted by "Spencer Ho (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783776#action_12783776 ] 

Spencer Ho commented on MAPREDUCE-1224:
---------------------------------------

@Aaron,
This particular case that triggered the patch submission is for Microsoft SQL Server.  For MySQL, I am using direct mode which works for most of the cases.

> Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1224
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>    Affects Versions: 0.20.1
>         Environment: all platforms, generic jdbc driver
>            Reporter: Spencer Ho
>         Attachments: MAPREDUCE-1224.patch, SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1224) Calling "SELECT t.* from

AS t" to get meta information is too expensive for big tablesPosted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796710#action_12796710 ] 

Hudson commented on MAPREDUCE-1224:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #196 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/196/])
    

> Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1224
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>    Affects Versions: 0.20.1
>         Environment: all platforms, generic jdbc driver
>            Reporter: Spencer Ho
>            Assignee: Spencer Ho
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1224.patch, SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1224) Calling "SELECT t.* from

AS t" to get meta information is too expensive for big tablesPosted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783810#action_12783810 ] 

Aaron Kimball commented on MAPREDUCE-1224:
------------------------------------------

Good to know that this works with SQL Server as well. Thanks for the patch.

> Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1224
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>    Affects Versions: 0.20.1
>         Environment: all platforms, generic jdbc driver
>            Reporter: Spencer Ho
>         Attachments: MAPREDUCE-1224.patch, SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1224) Calling "SELECT t.* from

AS t" to get meta information is too expensive for big tablesPosted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780822#action_12780822 ] 

Todd Lipcon commented on MAPREDUCE-1224:
----------------------------------------

Spencer: would you mind uploading this as a unified diff (patch) file?

> Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1224
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>    Affects Versions: 0.20.1
>         Environment: all platforms, generic jdbc driver
>            Reporter: Spencer Ho
>         Attachments: SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1224) Calling "SELECT t.* from

AS t" to get meta information is too expensive for big tablesPosted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780906#action_12780906 ] 

Hadoop QA commented on MAPREDUCE-1224:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12425689/MAPREDUCE-1224.patch
  against trunk revision 882790.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/256/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/256/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/256/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/256/console

This message is automatically generated.

> Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1224
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>    Affects Versions: 0.20.1
>         Environment: all platforms, generic jdbc driver
>            Reporter: Spencer Ho
>         Attachments: MAPREDUCE-1224.patch, SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1224) Calling "SELECT t.* from

AS t" to get meta information is too expensive for big tablesPosted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780349#action_12780349 ] 

Todd Lipcon commented on MAPREDUCE-1224:
----------------------------------------

Perhaps this could be changed to add "WHERE 1 = 0". Any SQL optimizer should evaluate this very quickly and return an empty result set, allowing metadata to be grabbed without actually doing work. Aaron?

> Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1224
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>    Affects Versions: 0.20.1
>         Environment: all platforms, generic jdbc driver
>            Reporter: Spencer Ho
>
> The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1224) Calling "SELECT t.* from

AS t" to get meta information is too expensive for big tablesPosted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780867#action_12780867 ] 

Jeff Hammerbacher commented on MAPREDUCE-1224:
----------------------------------------------

Should we try using actual JDBC metadata calls first? See, e.g., http://blog.codebeach.com/2008/12/database-metadata-with-jdbc.html

> Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1224
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>    Affects Versions: 0.20.1
>         Environment: all platforms, generic jdbc driver
>            Reporter: Spencer Ho
>         Attachments: MAPREDUCE-1224.patch, SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1224) Calling "SELECT t.* from

AS t" to get meta information is too expensive for big tablesPosted by "Spencer Ho (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Spencer Ho updated MAPREDUCE-1224:
----------------------------------

    Attachment: MAPREDUCE-1224.patch

Here is the patch file.  Please review.
Thanks for the reminder.

> Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1224
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>    Affects Versions: 0.20.1
>         Environment: all platforms, generic jdbc driver
>            Reporter: Spencer Ho
>         Attachments: MAPREDUCE-1224.patch, SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1224) Calling "SELECT t.* from

AS t" to get meta information is too expensive for big tablesPosted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783192#action_12783192 ] 

Aaron Kimball commented on MAPREDUCE-1224:
------------------------------------------

@Jeff Sqoop is already using the ResultSetMetaData associated with the query, rather than trying to read the DatabaseMetaData directly. Especially when we eventually support arbitrary user-supplied queries, this will be necessary. It can also be tricky to set all the parameters for a DatabaseMetaData correctly in a generic way. But to get at ResultSetMetaData (which definitely includes the proper typing information), a query must be submitted.

@Spenser This is a good catch and improvement! What database are you testing against? This patch passes unit tests against HSQLDB, PostgreSQL, and Oracle, so +1 from me. 

For PostgreSQL and MySQL, Sqoop uses {{connection.setFetchSize()}} to specify a row-buffered (rather than table-buffered) result, so it returns fast. But unfortunately, {{setFetchSize()}} is, like everything else in JDBC, poorly specified, so there isn't a good way to do this generically. This is a good way to ensure that the query returns quickly even if the database does not respect a row-buffered connection.


> Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1224
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>    Affects Versions: 0.20.1
>         Environment: all platforms, generic jdbc driver
>            Reporter: Spencer Ho
>         Attachments: MAPREDUCE-1224.patch, SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.