You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by "Keith Turner (Created) (JIRA)" <ji...@apache.org> on 2012/01/13 23:38:39 UTC

[jira] [Created] (ACCUMULO-314) Lookup in shard random walk test runs slowly sometimes

Lookup in shard random walk test runs slowly sometimes
------------------------------------------------------

                 Key: ACCUMULO-314
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-314
             Project: Accumulo
          Issue Type: Bug
          Components: tserver
         Environment: 1.4.0-SNAPSHOT on 10 node cluster
            Reporter: Keith Turner
            Assignee: Keith Turner
             Fix For: 1.4.0


While running the random walk test, I noticed the shard test was running slowly sometimes thanks to ACCUMULO-273.

{noformat}
13 19:56:47,284 [shard.Merge] DEBUG: merging ST_index_6389_1326478898465
13 19:56:52,543 [shard.Insert] DEBUG: Inserted document ac64000000000000
13 19:56:54,016 [shard.Commit] DEBUG: Committed inserts
13 19:56:54,019 [shard.Insert] DEBUG: Inserted document bc64000000000000
13 19:56:54,020 [shard.Insert] DEBUG: Inserted document cc64000000000000
13 19:56:54,021 [shard.Insert] DEBUG: Inserted document dc64000000000000
13 19:56:54,022 [shard.Insert] DEBUG: Inserted document ec64000000000000
13 19:56:54,023 [shard.Insert] DEBUG: Inserted document fc64000000000000
13 19:56:54,025 [shard.Insert] DEBUG: Inserted document 0d64000000000000
13 19:56:54,026 [shard.Insert] DEBUG: Inserted document 1d64000000000000
13 19:56:54,055 [shard.Commit] DEBUG: Committed inserts
13 19:56:54,068 [shard.Search] DEBUG: Looking up terms [154l, 1kzi] expect to find 9ee0000000000000
13 20:01:54,102 [randomwalk.Module] WARN : Node org.apache.accumulo.server.test.randomwalk.shard.Search has been running for 300.0 seconds. You may want to look into it.
13 20:05:52,530 [randomwalk.Module] WARN : Node org.apache.accumulo.server.test.randomwalk.shard.Search, which was running long, has now completed after 538.475 seconds
{noformat}

I noticed a merge usually preceded the slow lookups.  I looked the the master logs and saw that the merge finished ok and saw which tablet server the merged tablet was assigned to. Below are some snippets from the master log that show the table id and tablet server.

{noformat}
13 18:36:43,236 [tableOps.RenameTable] DEBUG: Renamed table 1bk ST_index_6389_1326478898465_tmp ST_index_6389_1326478898465

13 19:56:47,293 [tableOps.Utils] INFO : table 1bk (3b08cf01ba49883) locked for write operation: MERGE

13 19:56:52,496 [tableOps.Utils] INFO : table 1bk (3b08cf01ba49883) unlocked for write
13 19:56:52,504 [master.Master] DEBUG: Normal Tablets assigning tablet 1bk<<=xxx.xxx.xxx.xxx:9997[134d7425fc503db]
{noformat}

Some snippets from the tablet server logs are below and this shows the problem.

{noformat}
13 19:56:52,522 [tabletserver.Tablet] TABLET_HIST: 1bk<< opened

13 19:56:54,065 [tabletserver.Tablet] WARN : Tablet 1bk<< has too many files, batch lookup can not run

13 19:57:10,383 [tabletserver.Compactor] DEBUG: Compaction 1bk<< 6,954 read | 6,954 written | 108,656 entries/sec |  0.064 secs
13 19:57:10,402 [tabletserver.Tablet] TABLET_HIST: 1bk<< MajC [/t-0000qzs/C0000sj3.rf, /t-0000qzt/F0000rtf.rf, /t-0000qzt/F0000s0r.rf, /t-0000qzz/F0000sc0.rf, /t-0000r00/F0000s0v.rf, /t-0000r0f/C0000rpu.rf, /t-0000r0l/C0000qqz.rf, /t-0000rqt/C0000s3m.rf, /t-0000rra/C0000sbx.rf, /t-0000rrh/F0000sgj.rf] --> /c-00000054/C0000soe.rf

13 19:57:40,534 [tabletserver.Compactor] DEBUG: Compaction 1bk<< 21,036 read | 21,036 written | 104,656 entries/sec |  0.201 secs
13 19:57:40,564 [tabletserver.Tablet] TABLET_HIST: 1bk<< MajC [/t-0000qzm/C0000rfa.rf, /t-0000r0l/F0000sc6.rf, /t-0000rr1/C0000rpr.rf, /t-0000rr4/F0000sc2.rf, /t-0000rr5/F0000rq0.rf, /t-0000rr9/F0000s0y.rf, /t-0000rrb/F0000sc5.rf, /t-0000rrs/F0000sc7.rf, /t-0000rs1/F0000ssf.rf, /t-0000rs2/F0000ssg.rf] --> /c-00000054/C0000son.rf
{noformat}

The problem is that the merged tablet has too many files to open, so the batch scan for the shard test can not run.  However it takes the tablet server forver to work this issue out.  Every 30 seconds it compacts 10 tablet files down to one.  The compactions take a few hundred milliseconds, so it could be worked out much faster if the compactions occurred back to back.

In 1.3 compactions were changed from depth first to breadth first (e.g. if a tablet server has 100 tablets and all have 100 files, instead of compacting each tablet to one file go across the tablets compacting 10 at a time until each tablet has one file).  This change introduced this bug.  There is no need to wait 30 seconds between compactions in this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ACCUMULO-314) Re-queue tablets immediately after major compaction if there is more work

Posted by "Keith Turner (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ACCUMULO-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner updated ACCUMULO-314:
----------------------------------

    Summary: Re-queue tablets immediately after major compaction if there is more work   (was: Lookup in shard random walk test runs slowly sometimes)

Re-queuing tablet for compaction immediately if there is more work to will still preserve the breadth first nature because tablet in the compaction queue are sorted by the number of files.
                
> Re-queue tablets immediately after major compaction if there is more work 
> --------------------------------------------------------------------------
>
>                 Key: ACCUMULO-314
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-314
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>         Environment: 1.4.0-SNAPSHOT on 10 node cluster
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.4.0
>
>
> While running the random walk test, I noticed the shard test was running slowly sometimes thanks to ACCUMULO-273.
> {noformat}
> 13 19:56:47,284 [shard.Merge] DEBUG: merging ST_index_6389_1326478898465
> 13 19:56:52,543 [shard.Insert] DEBUG: Inserted document ac64000000000000
> 13 19:56:54,016 [shard.Commit] DEBUG: Committed inserts
> 13 19:56:54,019 [shard.Insert] DEBUG: Inserted document bc64000000000000
> 13 19:56:54,020 [shard.Insert] DEBUG: Inserted document cc64000000000000
> 13 19:56:54,021 [shard.Insert] DEBUG: Inserted document dc64000000000000
> 13 19:56:54,022 [shard.Insert] DEBUG: Inserted document ec64000000000000
> 13 19:56:54,023 [shard.Insert] DEBUG: Inserted document fc64000000000000
> 13 19:56:54,025 [shard.Insert] DEBUG: Inserted document 0d64000000000000
> 13 19:56:54,026 [shard.Insert] DEBUG: Inserted document 1d64000000000000
> 13 19:56:54,055 [shard.Commit] DEBUG: Committed inserts
> 13 19:56:54,068 [shard.Search] DEBUG: Looking up terms [154l, 1kzi] expect to find 9ee0000000000000
> 13 20:01:54,102 [randomwalk.Module] WARN : Node org.apache.accumulo.server.test.randomwalk.shard.Search has been running for 300.0 seconds. You may want to look into it.
> 13 20:05:52,530 [randomwalk.Module] WARN : Node org.apache.accumulo.server.test.randomwalk.shard.Search, which was running long, has now completed after 538.475 seconds
> {noformat}
> I noticed a merge usually preceded the slow lookups.  I looked the the master logs and saw that the merge finished ok and saw which tablet server the merged tablet was assigned to. Below are some snippets from the master log that show the table id and tablet server.
> {noformat}
> 13 18:36:43,236 [tableOps.RenameTable] DEBUG: Renamed table 1bk ST_index_6389_1326478898465_tmp ST_index_6389_1326478898465
> 13 19:56:47,293 [tableOps.Utils] INFO : table 1bk (3b08cf01ba49883) locked for write operation: MERGE
> 13 19:56:52,496 [tableOps.Utils] INFO : table 1bk (3b08cf01ba49883) unlocked for write
> 13 19:56:52,504 [master.Master] DEBUG: Normal Tablets assigning tablet 1bk<<=xxx.xxx.xxx.xxx:9997[134d7425fc503db]
> {noformat}
> Some snippets from the tablet server logs are below and this shows the problem.
> {noformat}
> 13 19:56:52,522 [tabletserver.Tablet] TABLET_HIST: 1bk<< opened
> 13 19:56:54,065 [tabletserver.Tablet] WARN : Tablet 1bk<< has too many files, batch lookup can not run
> 13 19:57:10,383 [tabletserver.Compactor] DEBUG: Compaction 1bk<< 6,954 read | 6,954 written | 108,656 entries/sec |  0.064 secs
> 13 19:57:10,402 [tabletserver.Tablet] TABLET_HIST: 1bk<< MajC [/t-0000qzs/C0000sj3.rf, /t-0000qzt/F0000rtf.rf, /t-0000qzt/F0000s0r.rf, /t-0000qzz/F0000sc0.rf, /t-0000r00/F0000s0v.rf, /t-0000r0f/C0000rpu.rf, /t-0000r0l/C0000qqz.rf, /t-0000rqt/C0000s3m.rf, /t-0000rra/C0000sbx.rf, /t-0000rrh/F0000sgj.rf] --> /c-00000054/C0000soe.rf
> 13 19:57:40,534 [tabletserver.Compactor] DEBUG: Compaction 1bk<< 21,036 read | 21,036 written | 104,656 entries/sec |  0.201 secs
> 13 19:57:40,564 [tabletserver.Tablet] TABLET_HIST: 1bk<< MajC [/t-0000qzm/C0000rfa.rf, /t-0000r0l/F0000sc6.rf, /t-0000rr1/C0000rpr.rf, /t-0000rr4/F0000sc2.rf, /t-0000rr5/F0000rq0.rf, /t-0000rr9/F0000s0y.rf, /t-0000rrb/F0000sc5.rf, /t-0000rrs/F0000sc7.rf, /t-0000rs1/F0000ssf.rf, /t-0000rs2/F0000ssg.rf] --> /c-00000054/C0000son.rf
> {noformat}
> The problem is that the merged tablet has too many files to open, so the batch scan for the shard test can not run.  However it takes the tablet server forver to work this issue out.  Every 30 seconds it compacts 10 tablet files down to one.  The compactions take a few hundred milliseconds, so it could be worked out much faster if the compactions occurred back to back.
> In 1.3 compactions were changed from depth first to breadth first (e.g. if a tablet server has 100 tablets and all have 100 files, instead of compacting each tablet to one file go across the tablets compacting 10 at a time until each tablet has one file).  This change introduced this bug.  There is no need to wait 30 seconds between compactions in this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (ACCUMULO-314) Re-queue tablets immediately after major compaction if there is more work

Posted by "Keith Turner (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ACCUMULO-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner resolved ACCUMULO-314.
-----------------------------------

    Resolution: Fixed
    
> Re-queue tablets immediately after major compaction if there is more work 
> --------------------------------------------------------------------------
>
>                 Key: ACCUMULO-314
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-314
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.3.5
>         Environment: 1.4.0-SNAPSHOT on 10 node cluster
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.4.0
>
>
> While running the random walk test, I noticed the shard test was running slowly sometimes thanks to ACCUMULO-273.
> {noformat}
> 13 19:56:47,284 [shard.Merge] DEBUG: merging ST_index_6389_1326478898465
> 13 19:56:52,543 [shard.Insert] DEBUG: Inserted document ac64000000000000
> 13 19:56:54,016 [shard.Commit] DEBUG: Committed inserts
> 13 19:56:54,019 [shard.Insert] DEBUG: Inserted document bc64000000000000
> 13 19:56:54,020 [shard.Insert] DEBUG: Inserted document cc64000000000000
> 13 19:56:54,021 [shard.Insert] DEBUG: Inserted document dc64000000000000
> 13 19:56:54,022 [shard.Insert] DEBUG: Inserted document ec64000000000000
> 13 19:56:54,023 [shard.Insert] DEBUG: Inserted document fc64000000000000
> 13 19:56:54,025 [shard.Insert] DEBUG: Inserted document 0d64000000000000
> 13 19:56:54,026 [shard.Insert] DEBUG: Inserted document 1d64000000000000
> 13 19:56:54,055 [shard.Commit] DEBUG: Committed inserts
> 13 19:56:54,068 [shard.Search] DEBUG: Looking up terms [154l, 1kzi] expect to find 9ee0000000000000
> 13 20:01:54,102 [randomwalk.Module] WARN : Node org.apache.accumulo.server.test.randomwalk.shard.Search has been running for 300.0 seconds. You may want to look into it.
> 13 20:05:52,530 [randomwalk.Module] WARN : Node org.apache.accumulo.server.test.randomwalk.shard.Search, which was running long, has now completed after 538.475 seconds
> {noformat}
> I noticed a merge usually preceded the slow lookups.  I looked the the master logs and saw that the merge finished ok and saw which tablet server the merged tablet was assigned to. Below are some snippets from the master log that show the table id and tablet server.
> {noformat}
> 13 18:36:43,236 [tableOps.RenameTable] DEBUG: Renamed table 1bk ST_index_6389_1326478898465_tmp ST_index_6389_1326478898465
> 13 19:56:47,293 [tableOps.Utils] INFO : table 1bk (3b08cf01ba49883) locked for write operation: MERGE
> 13 19:56:52,496 [tableOps.Utils] INFO : table 1bk (3b08cf01ba49883) unlocked for write
> 13 19:56:52,504 [master.Master] DEBUG: Normal Tablets assigning tablet 1bk<<=xxx.xxx.xxx.xxx:9997[134d7425fc503db]
> {noformat}
> Some snippets from the tablet server logs are below and this shows the problem.
> {noformat}
> 13 19:56:52,522 [tabletserver.Tablet] TABLET_HIST: 1bk<< opened
> 13 19:56:54,065 [tabletserver.Tablet] WARN : Tablet 1bk<< has too many files, batch lookup can not run
> 13 19:57:10,383 [tabletserver.Compactor] DEBUG: Compaction 1bk<< 6,954 read | 6,954 written | 108,656 entries/sec |  0.064 secs
> 13 19:57:10,402 [tabletserver.Tablet] TABLET_HIST: 1bk<< MajC [/t-0000qzs/C0000sj3.rf, /t-0000qzt/F0000rtf.rf, /t-0000qzt/F0000s0r.rf, /t-0000qzz/F0000sc0.rf, /t-0000r00/F0000s0v.rf, /t-0000r0f/C0000rpu.rf, /t-0000r0l/C0000qqz.rf, /t-0000rqt/C0000s3m.rf, /t-0000rra/C0000sbx.rf, /t-0000rrh/F0000sgj.rf] --> /c-00000054/C0000soe.rf
> 13 19:57:40,534 [tabletserver.Compactor] DEBUG: Compaction 1bk<< 21,036 read | 21,036 written | 104,656 entries/sec |  0.201 secs
> 13 19:57:40,564 [tabletserver.Tablet] TABLET_HIST: 1bk<< MajC [/t-0000qzm/C0000rfa.rf, /t-0000r0l/F0000sc6.rf, /t-0000rr1/C0000rpr.rf, /t-0000rr4/F0000sc2.rf, /t-0000rr5/F0000rq0.rf, /t-0000rr9/F0000s0y.rf, /t-0000rrb/F0000sc5.rf, /t-0000rrs/F0000sc7.rf, /t-0000rs1/F0000ssf.rf, /t-0000rs2/F0000ssg.rf] --> /c-00000054/C0000son.rf
> {noformat}
> The problem is that the merged tablet has too many files to open, so the batch scan for the shard test can not run.  However it takes the tablet server forver to work this issue out.  Every 30 seconds it compacts 10 tablet files down to one.  The compactions take a few hundred milliseconds, so it could be worked out much faster if the compactions occurred back to back.
> In 1.3 compactions were changed from depth first to breadth first (e.g. if a tablet server has 100 tablets and all have 100 files, instead of compacting each tablet to one file go across the tablets compacting 10 at a time until each tablet has one file).  This change introduced this bug.  There is no need to wait 30 seconds between compactions in this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ACCUMULO-314) Re-queue tablets immediately after major compaction if there is more work

Posted by "Keith Turner (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ACCUMULO-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner updated ACCUMULO-314:
----------------------------------

    Affects Version/s: 1.3.5
    
> Re-queue tablets immediately after major compaction if there is more work 
> --------------------------------------------------------------------------
>
>                 Key: ACCUMULO-314
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-314
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.3.5
>         Environment: 1.4.0-SNAPSHOT on 10 node cluster
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.4.0
>
>
> While running the random walk test, I noticed the shard test was running slowly sometimes thanks to ACCUMULO-273.
> {noformat}
> 13 19:56:47,284 [shard.Merge] DEBUG: merging ST_index_6389_1326478898465
> 13 19:56:52,543 [shard.Insert] DEBUG: Inserted document ac64000000000000
> 13 19:56:54,016 [shard.Commit] DEBUG: Committed inserts
> 13 19:56:54,019 [shard.Insert] DEBUG: Inserted document bc64000000000000
> 13 19:56:54,020 [shard.Insert] DEBUG: Inserted document cc64000000000000
> 13 19:56:54,021 [shard.Insert] DEBUG: Inserted document dc64000000000000
> 13 19:56:54,022 [shard.Insert] DEBUG: Inserted document ec64000000000000
> 13 19:56:54,023 [shard.Insert] DEBUG: Inserted document fc64000000000000
> 13 19:56:54,025 [shard.Insert] DEBUG: Inserted document 0d64000000000000
> 13 19:56:54,026 [shard.Insert] DEBUG: Inserted document 1d64000000000000
> 13 19:56:54,055 [shard.Commit] DEBUG: Committed inserts
> 13 19:56:54,068 [shard.Search] DEBUG: Looking up terms [154l, 1kzi] expect to find 9ee0000000000000
> 13 20:01:54,102 [randomwalk.Module] WARN : Node org.apache.accumulo.server.test.randomwalk.shard.Search has been running for 300.0 seconds. You may want to look into it.
> 13 20:05:52,530 [randomwalk.Module] WARN : Node org.apache.accumulo.server.test.randomwalk.shard.Search, which was running long, has now completed after 538.475 seconds
> {noformat}
> I noticed a merge usually preceded the slow lookups.  I looked the the master logs and saw that the merge finished ok and saw which tablet server the merged tablet was assigned to. Below are some snippets from the master log that show the table id and tablet server.
> {noformat}
> 13 18:36:43,236 [tableOps.RenameTable] DEBUG: Renamed table 1bk ST_index_6389_1326478898465_tmp ST_index_6389_1326478898465
> 13 19:56:47,293 [tableOps.Utils] INFO : table 1bk (3b08cf01ba49883) locked for write operation: MERGE
> 13 19:56:52,496 [tableOps.Utils] INFO : table 1bk (3b08cf01ba49883) unlocked for write
> 13 19:56:52,504 [master.Master] DEBUG: Normal Tablets assigning tablet 1bk<<=xxx.xxx.xxx.xxx:9997[134d7425fc503db]
> {noformat}
> Some snippets from the tablet server logs are below and this shows the problem.
> {noformat}
> 13 19:56:52,522 [tabletserver.Tablet] TABLET_HIST: 1bk<< opened
> 13 19:56:54,065 [tabletserver.Tablet] WARN : Tablet 1bk<< has too many files, batch lookup can not run
> 13 19:57:10,383 [tabletserver.Compactor] DEBUG: Compaction 1bk<< 6,954 read | 6,954 written | 108,656 entries/sec |  0.064 secs
> 13 19:57:10,402 [tabletserver.Tablet] TABLET_HIST: 1bk<< MajC [/t-0000qzs/C0000sj3.rf, /t-0000qzt/F0000rtf.rf, /t-0000qzt/F0000s0r.rf, /t-0000qzz/F0000sc0.rf, /t-0000r00/F0000s0v.rf, /t-0000r0f/C0000rpu.rf, /t-0000r0l/C0000qqz.rf, /t-0000rqt/C0000s3m.rf, /t-0000rra/C0000sbx.rf, /t-0000rrh/F0000sgj.rf] --> /c-00000054/C0000soe.rf
> 13 19:57:40,534 [tabletserver.Compactor] DEBUG: Compaction 1bk<< 21,036 read | 21,036 written | 104,656 entries/sec |  0.201 secs
> 13 19:57:40,564 [tabletserver.Tablet] TABLET_HIST: 1bk<< MajC [/t-0000qzm/C0000rfa.rf, /t-0000r0l/F0000sc6.rf, /t-0000rr1/C0000rpr.rf, /t-0000rr4/F0000sc2.rf, /t-0000rr5/F0000rq0.rf, /t-0000rr9/F0000s0y.rf, /t-0000rrb/F0000sc5.rf, /t-0000rrs/F0000sc7.rf, /t-0000rs1/F0000ssf.rf, /t-0000rs2/F0000ssg.rf] --> /c-00000054/C0000son.rf
> {noformat}
> The problem is that the merged tablet has too many files to open, so the batch scan for the shard test can not run.  However it takes the tablet server forver to work this issue out.  Every 30 seconds it compacts 10 tablet files down to one.  The compactions take a few hundred milliseconds, so it could be worked out much faster if the compactions occurred back to back.
> In 1.3 compactions were changed from depth first to breadth first (e.g. if a tablet server has 100 tablets and all have 100 files, instead of compacting each tablet to one file go across the tablets compacting 10 at a time until each tablet has one file).  This change introduced this bug.  There is no need to wait 30 seconds between compactions in this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira