You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2012/08/07 18:54:11 UTC

[jira] [Created] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Karl Wright created CONNECTORS-501:
--------------------------------------

             Summary: Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
                 Key: CONNECTORS-501
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
             Project: ManifoldCF
          Issue Type: Bug
          Components: Framework agents process, Web connector
    Affects Versions: ManifoldCF 0.6
            Reporter: Karl Wright
            Assignee: Karl Wright
             Fix For: ManifoldCF 0.7


The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430945#comment-13430945 ] 

Karl Wright commented on CONNECTORS-501:
----------------------------------------

Even after adding the new logic, I'm still seeing random differences with the expected number of documents, on the order of 8% or so.  Currently I'm stumped as to a scenario that would account for it; I'll need to do a run on a smaller set and attempt some forensics, seems to me.

                
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432206#comment-13432206 ] 

Karl Wright commented on CONNECTORS-501:
----------------------------------------

I think I see the scenario where things go wrong.  It goes like this:

(1) Imagine (a) -> (b) -> (c)
(2) We take the long route to (b) and the short route to (c), but (c) is still out of the running and is deleted
(3) We find a better route to (b) and that decreases the hopcount for (c) but (b) is not recrawled, because nothing important has changed, and therefore (c) is not requeued

One possible fix for this scenario involves repeating (b) if its hopcount decreases.  This, however, will mean a tremendous amount of recrawling to catch not too many outlying documents.  A subsequent job run might also at least converge towards the proper number.  I'll have to ponder what kind of solution we can implement and afford for the hopcount feature.

                
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>         Attachments: capture.txt
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432872#comment-13432872 ] 

Karl Wright commented on CONNECTORS-501:
----------------------------------------

Committed to trunk.

r1372225

                
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>         Attachments: capture.txt
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright resolved CONNECTORS-501.
------------------------------------

    Resolution: Fixed
    
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>         Attachments: capture.txt
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431434#comment-13431434 ] 

Karl Wright commented on CONNECTORS-501:
----------------------------------------

Hmm, I tried a straight reversion of the fix for CONNECTORS-464, and that also did not arrive at the correct doc count.  Meanwhile, I attempted to revert the key changes for CONNECTORS-464 from the CONNECTORS-501 branch, but wound up with code that clearly still deletes too many intrinsiclink table records.  Debugging now...

                
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432210#comment-13432210 ] 

Karl Wright commented on CONNECTORS-501:
----------------------------------------

Another potentially more interesting trick would be to only recrawl those documents that have a hopcount that is on "the edge", and whose hopcounts decline.

                
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>         Attachments: capture.txt
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431248#comment-13431248 ] 

Karl Wright commented on CONNECTORS-501:
----------------------------------------

The fix for CONNECTORS-464 seems to be the source of this bug.  The fix removed intrinsic links on either end of a document in the jobqueue.  The logic in question may well have been in place to address this problem.

                
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430506#comment-13430506 ] 

Karl Wright commented on CONNECTORS-501:
----------------------------------------

Here's a potential race:

(1) There are two paths to get to a document, one longer, and one shorter.
(2) The first worker thread picks up the document after the longer path has queued it up, and decides to delete it
(3) Before the document is deleted, however, the shorter path is evaluated in a different thread and tries to queue it up
(4) The first thread deletes the document anyway

We had a similar race condition with carrydown data, and fixed it by detecting the potential conflict (in that case by noting a change in the carrydown information the document would see, plus the document being in the "active" state).  We need to do something similar for hopcount I think.

                
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433109#comment-13433109 ] 

Karl Wright commented on CONNECTORS-501:
----------------------------------------

Also, optimizations:

r1372402

                
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>         Attachments: capture.txt
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430557#comment-13430557 ] 

Karl Wright commented on CONNECTORS-501:
----------------------------------------

I created a CONNECTORS-501 branch to work on this ticket.  I've checked in code which should put documents that are in "active" into "activerescanneeded" if their hopcount situation changes during processing.  I still need to pick up on this change for deletion, however - the logic there now requires documents that are in "activerescanneeded" to be put back into "active" and not actually deleted.  Because the same jobManager deletion method is used in many places, I may wind up creating a new jobManager method meant to work only in the context of an active document.


                
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright updated CONNECTORS-501:
-----------------------------------

    Attachment: capture.txt

Output of a run of the test showing deletions without re-adds
                
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>         Attachments: capture.txt
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431222#comment-13431222 ] 

Karl Wright commented on CONNECTORS-501:
----------------------------------------

I have confirmed that most if not all deletions are the result of hopcount delete code being triggered.  Furthermore, I have a scenario that would account for the deletions.

The scenario looks like this:

- Start with two documents, a and b
- There are two paths from a to b, one longer than the other
- There are two paths from the seed to a, one longer than the other
- If we arrive at b via the longer path from seed to a and the shorter path from a to b, then b may be removed along with the (shorter) link from a to b
- The system will not recover because only the longer link from a to b will be discoverable after the shorter link has been removed

Basically this means that we cannot remove intrinsic links even though job queue entries have been removed.
                
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432195#comment-13432195 ] 

Karl Wright commented on CONNECTORS-501:
----------------------------------------

Intrinsiclink undercount fixed, but did not solve the problem, as expected.

I verified that the counts look completely correct when all hopcount filtering is turned off.  I also captured a trace of the document hashes deleted because of hopcount and re-added later.  It turns out that (as expected) many of the deletions are not re-added.  I've attached the capture.

                
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432866#comment-13432866 ] 

Karl Wright commented on CONNECTORS-501:
----------------------------------------

Found the problem and checked in a fix.  FINALLY I'm getting the right counts!!

                
> Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>         Attachments: capture.txt
>
>
> The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 11110 documents it is supposed to.  It only discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira