You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by "Tony Brusseau (JIRA)" <ji...@apache.org> on 2013/01/18 19:56:12 UTC
[jira] [Commented] (DERBY-6045) Can't bulk load terms by primary key efficiently on tables >256 rows

    [ https://issues.apache.org/jira/browse/DERBY-6045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557475#comment-13557475 ] 

Tony Brusseau commented on DERBY-6045:
--------------------------------------

I did some more testing and it looks like if I create a table with 128 rows, then b-tree searches are performed in every case. However, if I increase the table to 256 rows, then I start getting heap searches when querying for multiple rows via primary key. Here is some code that will generate a table that demonstrates the problem:

public static void main(String[] args) {
    final Logger logger = Logger.getLogger(DerbyTester.class.toString());
    try {
      StringBuilder buf = new StringBuilder(1024*1024);
      buf.append("DROP TABLE VARIABLE_TERM;\n");
      buf.append("\n");
      buf.append("CREATE TABLE VARIABLE_TERM\n");
      buf.append("(\n");
      buf.append("        term_id                  BIGINT NOT NULL,\n");
      buf.append("        var_name              VARCHAR(1024) NOT NULL,\n");
      buf.append("        var_type                SMALLINT NOT NULL,     \n");
      buf.append("        kb_status              INTEGER NOT NULL        \n");
      buf.append(");\n");
      buf.append("\n");
      buf.append("ALTER TABLE variable_term ADD CONSTRAINT kb_variable_term_term_id_pk PRIMARY KEY (term_id);\n");
      buf.append("ALTER TABLE variable_term ADD CONSTRAINT variable_term_id_check CHECK ((TERM_ID >= 1688849860263936) AND (TERM_ID <= 1970324836974591));\n");
      buf.append("ALTER TABLE variable_term ADD CONSTRAINT kb_variable_term_variable_name_unique UNIQUE (var_name, var_type);\n");
      buf.append("ALTER TABLE variable_term ADD CONSTRAINT kb_variable_term_var_type_enum_check CHECK ((var_type >= 0) AND (var_type <= 4));\n");
      buf.append("\n");
      int count = 256; // @note count should be at least 40
      long idStart = 1688849860263936L;
      for (int i = 0; i < count; i++) {
        buf.append("INSERT INTO VARIABLE_TERM VALUES (").append(idStart++).append(", \'?var").append(i).append("\', ").append(((i %2) == 0) ? 1 : 4).append(", 1);\n");
      }
      buf.append("\n");
      buf.append("SELECT * FROM VARIABLE_TERM WHERE TERM_ID = 1688849860263966;\n");
      buf.append("SELECT * FROM VARIABLE_TERM WHERE (TERM_ID = 1688849860263937) OR (TERM_ID =1688849860263975) OR (TERM_ID = 1688849860263960);\n");
      buf.append("SELECT * FROM VARIABLE_TERM WHERE (TERM_ID  IN (1688849860263937, 1688849860263975, 1688849860263960));\n");
       buf.append("\n");
       System.out.println(buf.toString());
       System.out.flush();
    } catch (Exception e) {
      logger.log(Level.SEVERE, e.getMessage(), e);
    } finally {
      System.exit(0);
    }
  }
                
> Can't bulk load terms by primary key efficiently on tables >256 rows
> --------------------------------------------------------------------
>
>                 Key: DERBY-6045
>                 URL: https://issues.apache.org/jira/browse/DERBY-6045
>             Project: Derby
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 10.9.1.0
>         Environment: Linux Debian 6.0.5
>            Reporter: Tony Brusseau
>            Priority: Critical
>
> I have a table with a long integer primary key field and 11 million rows. I seem to be unable to load large chunks of rows via id in a reasonably efficient manner.
>   1. If I do individual lookups via the primary key, then a fast indexed lookup occurs. However, if I do large numbers of such queries, then the time is overwhelmed by round-trip overhead which makes everything incredibly slow.
>   2. If I use a single query with a disjunction of the primary keys of interest,  then a table scan is performed (even if the clause only contains 1-3 items), which walks over 11 million rows...incredibly inefficient.
>   3. If I use an IN clause, then a table scan is performed (even if the clause only contains 1-3 items), which walks over 11 million rows...incredibly inefficient.
> I'm guessing that this might have something to do with the fact that I'm using large integers and really big numbers that don't start anywhere at or about 1 for my keys. Could this possibly be confusing the optimizer?
> Here are the unlimited query plans for the 3 cases that I enumerated:
> *********************************************************************************************
> [EL Fine]: 2013-01-17 11:09:53.384--ServerSession(582235416)--Connection(1430986883)--Thread(Thread["Initial Lisp Listener",5,SubL Thread Group])--SELECT TERM_ID, ARG0, ARG1, ARG2, ARG3, FORMULA_HASH, FORMULA_LENGTH, FORMULA_TYPE, KB_STATUS FROM KB.FORMULA_TERM WHERE (TERM_ID = ?)
> 	bind => [2251799814033500]
> Thu Jan 17 11:09:53 CST 2013 Thread["Initial Lisp Listener",5,SubL Thread Group] (XID = 4711079), (SESSIONID = 3), SELECT TERM_ID, ARG0, ARG1, ARG2, ARG3, FORMULA_HASH, FORMULA_LENGTH, FORMULA_TYPE, KB_STATUS FROM KB.FORMULA_TERM WHERE (TERM_ID = ?) ******* Project-Restrict ResultSet (3):
> Number of opens = 1
> Rows seen = 1
> Rows filtered = 0
> restriction = false
> projection = true
> 	constructor time (milliseconds) = 0
> 	open time (milliseconds) = 0
> 	next time (milliseconds) = 0
> 	close time (milliseconds) = 0
> 	restriction time (milliseconds) = 0
> 	projection time (milliseconds) = 0
> 	optimizer estimated row count: 1.00
> 	optimizer estimated cost: 6.59
> Source result set:
> 	Index Row to Base Row ResultSet for FORMULA_TERM:
> 	Number of opens = 1
> 	Rows seen = 1
> 	Columns accessed from heap = {1, 2, 3, 4, 5, 6, 7, 8}
> 		constructor time (milliseconds) = 0
> 		open time (milliseconds) = 0
> 		next time (milliseconds) = 0
> 		close time (milliseconds) = 0
> 		optimizer estimated row count: 1.00
> 		optimizer estimated cost: 6.59
> 		Index Scan ResultSet for FORMULA_TERM using constraint KB_FORMULA_TERM_TERM_ID_PK at read committed isolation level using share row locking chosen by the optimizer
> 		Number of opens = 1
> 		Rows seen = 1
> 		Rows filtered = 0
> 		Fetch Size = 1
> 			constructor time (milliseconds) = 0
> 			open time (milliseconds) = 0
> 			next time (milliseconds) = 0
> 			close time (milliseconds) = 0
> 			next time in milliseconds/row = 0
> 		scan information:
> 			Bit set of columns fetched=All
> 			Number of columns fetched=2
> 			Number of deleted rows visited=0
> 			Number of pages visited=3
> 			Number of rows qualified=1
> 			Number of rows visited=1
> 			Scan type=btree
> 			Tree height=-1
> 			start position:
> 				>= on first 1 column(s).
> 				Ordered null semantics on the following columns: 
> 			stop position:
> 				> on first 1 column(s).
> 				Ordered null semantics on the following columns: 
> 			qualifiers:
> 				None
> 			optimizer estimated row count: 1.00
> 			optimizer estimated cost: 6.59
> [EL Fine]: 2013-01-17 11:01:00.732--ServerSession(1237006689)--Connection(927179828)--Thread(Thread["Initial Lisp Listener",5,SubL Thread Group])--SELECT TERM_ID, ARG0, ARG1, ARG2, ARG3, FORMULA_HASH, FORMULA_LENGTH, FORMULA_TYPE, KB_STATUS FROM KB.FORMULA_TERM WHERE (((TERM_ID = ?) OR (TERM_ID = ?)) OR (TERM_ID = ?))
> 	bind => [2251799814033500, 2251799814033501, 2251799814033499]
> Thu Jan 17 11:01:10 CST 2013 Thread["Initial Lisp Listener",5,SubL Thread Group] (XID = 4711078), (SESSIONID = 3), SELECT TERM_ID, ARG0, ARG1, ARG2, ARG3, FORMULA_HASH, FORMULA_LENGTH, FORMULA_TYPE, KB_STATUS FROM KB.FORMULA_TERM WHERE (((TERM_ID = ?) OR (TERM_ID = ?)) OR (TERM_ID = ?)) ******* Project-Restrict ResultSet (3):
> Number of opens = 1
> Rows seen = 3
> Rows filtered = 0
> restriction = false
> projection = true
> 	constructor time (milliseconds) = 0
> 	open time (milliseconds) = 0
> 	next time (milliseconds) = 0
> 	close time (milliseconds) = 0
> 	restriction time (milliseconds) = 0
> 	projection time (milliseconds) = 0
> 	optimizer estimated row count: 1176730.30
> 	optimizer estimated cost: 5931065.54
> Source result set:
> 	Project-Restrict ResultSet (2):
> 	Number of opens = 1
> 	Rows seen = 11767298
> 	Rows filtered = 11767295
> 	restriction = true
> 	projection = false
> 		constructor time (milliseconds) = 0
> 		open time (milliseconds) = 0
> 		next time (milliseconds) = 0
> 		close time (milliseconds) = 0
> 		restriction time (milliseconds) = 0
> 		projection time (milliseconds) = 0
> 		optimizer estimated row count: 1176730.30
> 		optimizer estimated cost: 5931065.54
> 	Source result set:
> 		Table Scan ResultSet for FORMULA_TERM at read committed isolation level using instantaneous share row locking chosen by the optimizer
> 		Number of opens = 1
> 		Rows seen = 11767298
> 		Rows filtered = 0
> 		Fetch Size = 16
> 			constructor time (milliseconds) = 0
> 			open time (milliseconds) = 0
> 			next time (milliseconds) = 0
> 			close time (milliseconds) = 0
> 			next time in milliseconds/row = 0
> 		scan information:
> 			Bit set of columns fetched=All
> 			Number of columns fetched=9
> 			Number of pages visited=34358
> 			Number of rows qualified=11767298
> 			Number of rows visited=11767298
> 			Scan type=heap
> 			start position:
> 				null
> 			stop position:
> 				null
> 			qualifiers:
> 				None
> 			optimizer estimated row count: 1176730.30
> 			optimizer estimated cost: 5931065.54
> [EL Fine]: 2013-01-17 11:27:00.627--ServerSession(1237006689)--Connection(1688096771)--Thread(Thread["Initial Lisp Listener",5,SubL Thread Group])--SELECT TERM_ID, ARG0, ARG1, ARG2, ARG3, FORMULA_HASH, FORMULA_LENGTH, FORMULA_TYPE, KB_STATUS FROM KB.FORMULA_TERM WHERE (TERM_ID IN (?,?,?))
> 	bind => [2251799814033500, 2251799814033501, 2251799814033499]
> Thu Jan 17 11:47:26 CST 2013 Thread["Initial Lisp Listener",5,SubL Thread Group] (XID = 4711080), (SESSIONID = 3), SELECT TERM_ID, ARG0, ARG1, ARG2, ARG3, FORMULA_HASH, FORMULA_LENGTH, FORMULA_TYPE, KB_STATUS FROM KB.FORMULA_TERM WHERE (TERM_ID IN (?,?,?)) ******* Project-Restrict ResultSet (3):
> Number of opens = 1
> Rows seen = 3
> Rows filtered = 0
> restriction = false
> projection = true
> 	constructor time (milliseconds) = 0
> 	open time (milliseconds) = 0
> 	next time (milliseconds) = 0
> 	close time (milliseconds) = 0
> 	restriction time (milliseconds) = 0
> 	projection time (milliseconds) = 0
> 	optimizer estimated row count: 1176730.30
> 	optimizer estimated cost: 5931065.54
> Source result set:
> 	Project-Restrict ResultSet (2):
> 	Number of opens = 1
> 	Rows seen = 11767298
> 	Rows filtered = 11767295
> 	restriction = true
> 	projection = false
> 		constructor time (milliseconds) = 0
> 		open time (milliseconds) = 0
> 		next time (milliseconds) = 0
> 		close time (milliseconds) = 0
> 		restriction time (milliseconds) = 0
> 		projection time (milliseconds) = 0
> 		optimizer estimated row count: 1176730.30
> 		optimizer estimated cost: 5931065.54
> 	Source result set:
> 		Table Scan ResultSet for FORMULA_TERM at read committed isolation level using instantaneous share row locking chosen by the optimizer
> 		Number of opens = 1
> 		Rows seen = 11767298
> 		Rows filtered = 0
> 		Fetch Size = 16
> 			constructor time (milliseconds) = 0
> 			open time (milliseconds) = 0
> 			next time (milliseconds) = 0
> 			close time (milliseconds) = 0
> 			next time in milliseconds/row = 0
> 		scan information:
> 			Bit set of columns fetched=All
> 			Number of columns fetched=9
> 			Number of pages visited=34358
> 			Number of rows qualified=11767298
> 			Number of rows visited=11767298
> 			Scan type=heap
> 			start position:
> 				null
> 			stop position:
> 				null
> 			qualifiers:
> 				None
> 			optimizer estimated row count: 1176730.30
> 			optimizer estimated cost: 5931065.54

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira