You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (Created) (JIRA)" <ji...@apache.org> on 2011/10/24 11:35:32 UTC

[jira] [Created] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

RSS connector takes nearly a second to fetch a document even with no throttling
-------------------------------------------------------------------------------

                 Key: CONNECTORS-281
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
             Project: ManifoldCF
          Issue Type: Bug
          Components: RSS connector
    Affects Versions: ManifoldCF 0.4
            Reporter: Karl Wright
            Assignee: Karl Wright
             Fix For: ManifoldCF 0.4


The RSS connector load test shows that the RSS connector is overthrottling, for some reason.

10-24-2011 05:30:50.423 	fetch 	http://localhost:8189/rss/gen.php?doc=4&feed=782&type=doc
	200 	46 	843

... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with connection parameters as follows:

Parameters: 	Robots usage=none
Max fetches per minute=1000000
Email address=somebody@somewhere.com
KB per second=1000000
Max server connections=100


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

Posted by "Karl Wright (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright resolved CONNECTORS-281.
------------------------------------

    Resolution: Fixed
    
> RSS connector takes nearly a second to fetch a document even with no throttling
> -------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-281
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.4
>
>
> The RSS connector load test shows that the RSS connector is overthrottling, for some reason.
> 10-24-2011 05:30:50.423 	fetch 	http://localhost:8189/rss/gen.php?doc=4&feed=782&type=doc
> 	200 	46 	843
> ... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with connection parameters as follows:
> Parameters: 	Robots usage=none
> Max fetches per minute=1000000
> Email address=somebody@somewhere.com
> KB per second=1000000
> Max server connections=100

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133957#comment-13133957 ] 

Karl Wright commented on CONNECTORS-281:
----------------------------------------

A second capture shows a much more expected mix of thread activities.  Most of the threads are waiting for database activities, mostly inserting documents and managing carrydown information.  The non-database work seems to be fetching documents and parsing URLs, which is as it should be.

                
> RSS connector takes nearly a second to fetch a document even with no throttling
> -------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-281
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.4
>
>
> The RSS connector load test shows that the RSS connector is overthrottling, for some reason.
> 10-24-2011 05:30:50.423 	fetch 	http://localhost:8189/rss/gen.php?doc=4&feed=782&type=doc
> 	200 	46 	843
> ... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with connection parameters as follows:
> Parameters: 	Robots usage=none
> Max fetches per minute=1000000
> Email address=somebody@somewhere.com
> KB per second=1000000
> Max server connections=100

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133932#comment-13133932 ] 

Karl Wright commented on CONNECTORS-281:
----------------------------------------

During this time CPU is pretty much maxed out between the postgres activity and the agents daemon.  The agents daemon is the lion's share of that (75%).

It is possible that the cause is simply insufficient RAM, since I've only given the agents daemon 128MB for the test.  But the times per fetch seem too high and too consistent to be due to garbage collection.  But something is clearly busy-waiting nonetheless.

A thread dump would be helpful.



                
> RSS connector takes nearly a second to fetch a document even with no throttling
> -------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-281
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.4
>
>
> The RSS connector load test shows that the RSS connector is overthrottling, for some reason.
> 10-24-2011 05:30:50.423 	fetch 	http://localhost:8189/rss/gen.php?doc=4&feed=782&type=doc
> 	200 	46 	843
> ... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with connection parameters as follows:
> Parameters: 	Robots usage=none
> Max fetches per minute=1000000
> Email address=somebody@somewhere.com
> KB per second=1000000
> Max server connections=100

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137765#comment-13137765 ] 

Karl Wright commented on CONNECTORS-281:
----------------------------------------

With the fixes as checked in so far, there's no appreciable difference between early parts of the crawl and later parts.  So I'm going to resolve this issue.

                
> RSS connector takes nearly a second to fetch a document even with no throttling
> -------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-281
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.4
>
>
> The RSS connector load test shows that the RSS connector is overthrottling, for some reason.
> 10-24-2011 05:30:50.423 	fetch 	http://localhost:8189/rss/gen.php?doc=4&feed=782&type=doc
> 	200 	46 	843
> ... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with connection parameters as follows:
> Parameters: 	Robots usage=none
> Max fetches per minute=1000000
> Email address=somebody@somewhere.com
> KB per second=1000000
> Max server connections=100

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133951#comment-13133951 ] 

Karl Wright commented on CONNECTORS-281:
----------------------------------------

A thread dump shows all worker threads waiting on database functionality, but this is interesting.  A full 11/30 threads are waiting to RETURN connections to the pool:

"Worker thread '5'" daemon prio=6 tid=0x055b5c00 nid=0x1c14 waiting for monitor entry [0x05def000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.bitmechanic.sql.ConnectionPool.returnConnection(ConnectionPool.java:474)
	- waiting to lock <0x292ad3a0> (a com.bitmechanic.sql.ConnectionPool)
	at com.bitmechanic.sql.PooledConnection.close(PooledConnection.java:202)
	at org.apache.manifoldcf.core.database.ConnectionFactory.releaseConnection(ConnectionFactory.java:113)
	at org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:330)
	at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1112)
	at org.apache.manifoldcf.core.database.BaseTable.endTransaction(BaseTable.java:274)
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.noteDocumentIngest(IncrementalIngester.java:1373)
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:503)
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentRecord(IncrementalIngester.java:325)
	at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.recordDocument(WorkerThread.java:1556)
	at org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.processDocuments(RSSConnector.java:1281)
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:561)

It's not clear at all why this should be.  The only possible hint is that there's one thread waiting on GETTING a connection from the pool:

"Worker thread '19'" daemon prio=6 tid=0x045e1400 nid=0x1b94 waiting for monitor entry [0x0624f000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.bitmechanic.sql.ConnectionPool.getConnection(ConnectionPool.java:375)
	- waiting to lock <0x292ad3a0> (a com.bitmechanic.sql.ConnectionPool)
	at com.bitmechanic.sql.ConnectionPoolManager.connect(ConnectionPoolManager.java:442)
	at java.sql.DriverManager.getConnection(DriverManager.java:582)
	at java.sql.DriverManager.getConnection(DriverManager.java:207)
	at org.apache.manifoldcf.core.database.ConnectionFactory.getConnectionWithRetries(ConnectionFactory.java:144)
	at org.apache.manifoldcf.core.database.ConnectionFactory.getConnection(ConnectionFactory.java:90)
	at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:502)
	at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1152)
	at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
	at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:168)
	at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:860)
	at org.apache.manifoldcf.crawler.jobs.Carrydown.getDataValuesAsFiles(Carrydown.java:603)
	at org.apache.manifoldcf.crawler.jobs.JobManager.retrieveParentDataAsFiles(JobManager.java:4263)
	at org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.retrieveParentDataAsFiles(WorkerThread.java:1211)
	at org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.getDocumentVersions(RSSConnector.java:818)
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)

This has me wondering if we're seeing a bug in the pool driver.  I'll try to confirm with further stack traces, since this does not explain the high CPU usage.


                
> RSS connector takes nearly a second to fetch a document even with no throttling
> -------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-281
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.4
>
>
> The RSS connector load test shows that the RSS connector is overthrottling, for some reason.
> 10-24-2011 05:30:50.423 	fetch 	http://localhost:8189/rss/gen.php?doc=4&feed=782&type=doc
> 	200 	46 	843
> ... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with connection parameters as follows:
> Parameters: 	Robots usage=none
> Max fetches per minute=1000000
> Email address=somebody@somewhere.com
> KB per second=1000000
> Max server connections=100

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134660#comment-13134660 ] 

Karl Wright commented on CONNECTORS-281:
----------------------------------------

r1188465 to remove synchronization that single-threads file deletion.

                
> RSS connector takes nearly a second to fetch a document even with no throttling
> -------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-281
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.4
>
>
> The RSS connector load test shows that the RSS connector is overthrottling, for some reason.
> 10-24-2011 05:30:50.423 	fetch 	http://localhost:8189/rss/gen.php?doc=4&feed=782&type=doc
> 	200 	46 	843
> ... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with connection parameters as follows:
> Parameters: 	Robots usage=none
> Max fetches per minute=1000000
> Email address=somebody@somewhere.com
> KB per second=1000000
> Max server connections=100

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134662#comment-13134662 ] 

Karl Wright commented on CONNECTORS-281:
----------------------------------------

Use of temporary files, plus carry-down data, seems to be what makes the RSS connector significantly slower than a file crawl.  I'm still trying to assess whether the carrydown data is the issue later in the crawl.

                
> RSS connector takes nearly a second to fetch a document even with no throttling
> -------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-281
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.4
>
>
> The RSS connector load test shows that the RSS connector is overthrottling, for some reason.
> 10-24-2011 05:30:50.423 	fetch 	http://localhost:8189/rss/gen.php?doc=4&feed=782&type=doc
> 	200 	46 	843
> ... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with connection parameters as follows:
> Parameters: 	Robots usage=none
> Max fetches per minute=1000000
> Email address=somebody@somewhere.com
> KB per second=1000000
> Max server connections=100

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133960#comment-13133960 ] 

Karl Wright commented on CONNECTORS-281:
----------------------------------------

A third dump shows up the temporary file tracker as being a potential bottleneck.  Many threads are waiting on the same synchronizer:

"Worker thread '7'" daemon prio=6 tid=0x055b6800 nid=0x165c waiting for monitor entry [0x05e8f000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.manifoldcf.core.system.ManifoldCF$FileTrack.addFile(ManifoldCF.java:1177)
	- waiting to lock <0x292a2728> (a org.apache.manifoldcf.core.system.ManifoldCF$FileTrack)
	at org.apache.manifoldcf.core.system.ManifoldCF.addFile(ManifoldCF.java:701)
	at org.apache.manifoldcf.crawler.connectors.rss.DataCache.addData(DataCache.java:67)
	at org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.getDocumentVersions(RSSConnector.java:1065)
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)

... both add and delete.

                
> RSS connector takes nearly a second to fetch a document even with no throttling
> -------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-281
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.4
>
>
> The RSS connector load test shows that the RSS connector is overthrottling, for some reason.
> 10-24-2011 05:30:50.423 	fetch 	http://localhost:8189/rss/gen.php?doc=4&feed=782&type=doc
> 	200 	46 	843
> ... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with connection parameters as follows:
> Parameters: 	Robots usage=none
> Max fetches per minute=1000000
> Email address=somebody@somewhere.com
> KB per second=1000000
> Max server connections=100

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira