You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by "Maciej Lizewski (JIRA)" <ji...@apache.org> on 2012/11/13 14:08:24 UTC

[jira] [Created] (CONNECTORS-567) Extended seeding interface which provides document versions

Maciej Lizewski created CONNECTORS-567:
------------------------------------------

             Summary: Extended seeding interface which provides document versions
                 Key: CONNECTORS-567
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-567
             Project: ManifoldCF
          Issue Type: Wish
            Reporter: Maciej Lizewski


There are some cases when seeding function can provide document version with data it already has.
Current data flow needs one call to addSeedDocuments, then call to getDocumentVersions, which essentialy must fetch same data, and after that one more call to processDocuments. The last one probably needs separate call because it needs to fetch document body, however seeding and getting versions in many cases work on very same data (and probably duplicating requests to repository).

Now - reducing number of needed request to repository by eliminating getDocumentVersions call for document which have version returned by addSeedDocuments could significantly reduce load.

getDocumentVersions would still be called for older docuemnts (not returned by addSeedDocuments) to check if they were modified or deleted.

This is only proposition...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CONNECTORS-567) Extended seeding interface which provides document versions

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CONNECTORS-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497916#comment-13497916 ] 

Karl Wright commented on CONNECTORS-567:
----------------------------------------

There are a number of connectors that need to do version checks across many threads, not just one, which is why I originally designed the connector interface the way I did.

I could imagine supporting both models, however.  The IxxxActivity interfaces were invented to allow the crawling model to be extended without breaking existing connectors.  All you would have to do (in theory) to support something like what you are talking about would be to add a new ISeedingActivity method that would record not only a document's discovery, but also its version information.

However, this is not a trivial change internally, because the flow at the moment involves obtaining the version information in the same worker thread that would process the information if the version indicated that was needed.  So dispatch to the worker thread will have already taken place either way, and the only real difference would be that somehow we'd decide it was unnecessary to call getDocumentVersions() for certain documents.  But you'd still need to support getDocumentVersions() for older documents, as you point out, so I'm having a bit of a hard time figuring out exactly when a document would be "old enough" to call getDocumentVersions().

A much easier model would be to support an all-in-one approach, which might be appropriate for something like JDBC.  In that model the seeding query returns everything, and getDocumentVersions() and processDocuments() does nothing.

It may be worth reading ManifoldCF in Action, especially the parts about crawling models, since that may help inform your thoughts a bit.

                
> Extended seeding interface which provides document versions
> -----------------------------------------------------------
>
>                 Key: CONNECTORS-567
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-567
>             Project: ManifoldCF
>          Issue Type: Wish
>            Reporter: Maciej Lizewski
>
> There are some cases when seeding function can provide document version with data it already has.
> Current data flow needs one call to addSeedDocuments, then call to getDocumentVersions, which essentialy must fetch same data, and after that one more call to processDocuments. The last one probably needs separate call because it needs to fetch document body, however seeding and getting versions in many cases work on very same data (and probably duplicating requests to repository).
> Now - reducing number of needed request to repository by eliminating getDocumentVersions call for document which have version returned by addSeedDocuments could significantly reduce load.
> getDocumentVersions would still be called for older docuemnts (not returned by addSeedDocuments) to check if they were modified or deleted.
> This is only proposition...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CONNECTORS-567) Extended seeding interface which provides document versions

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CONNECTORS-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498003#comment-13498003 ] 

Karl Wright commented on CONNECTORS-567:
----------------------------------------

I think not being able to handle deletions is a significant problem, since this is an incremental crawler.  We'd have to have a solution to that problem before this could be a possibility.  Right now the only ways deletion is detected is by getting the version string for the document that no longer exists.

Also, FWIW, I make free copies of ManifoldCF in Action available to all committers, upon request.

                
> Extended seeding interface which provides document versions
> -----------------------------------------------------------
>
>                 Key: CONNECTORS-567
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-567
>             Project: ManifoldCF
>          Issue Type: Wish
>            Reporter: Maciej Lizewski
>
> There are some cases when seeding function can provide document version with data it already has.
> Current data flow needs one call to addSeedDocuments, then call to getDocumentVersions, which essentialy must fetch same data, and after that one more call to processDocuments. The last one probably needs separate call because it needs to fetch document body, however seeding and getting versions in many cases work on very same data (and probably duplicating requests to repository).
> Now - reducing number of needed request to repository by eliminating getDocumentVersions call for document which have version returned by addSeedDocuments could significantly reduce load.
> getDocumentVersions would still be called for older docuemnts (not returned by addSeedDocuments) to check if they were modified or deleted.
> This is only proposition...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CONNECTORS-567) Extended seeding interface which provides document versions

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CONNECTORS-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498828#comment-13498828 ] 

Karl Wright commented on CONNECTORS-567:
----------------------------------------

I mailed the book to your gmail account - please confirm you received it.

                
> Extended seeding interface which provides document versions
> -----------------------------------------------------------
>
>                 Key: CONNECTORS-567
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-567
>             Project: ManifoldCF
>          Issue Type: Wish
>            Reporter: Maciej Lizewski
>
> There are some cases when seeding function can provide document version with data it already has.
> Current data flow needs one call to addSeedDocuments, then call to getDocumentVersions, which essentialy must fetch same data, and after that one more call to processDocuments. The last one probably needs separate call because it needs to fetch document body, however seeding and getting versions in many cases work on very same data (and probably duplicating requests to repository).
> Now - reducing number of needed request to repository by eliminating getDocumentVersions call for document which have version returned by addSeedDocuments could significantly reduce load.
> getDocumentVersions would still be called for older docuemnts (not returned by addSeedDocuments) to check if they were modified or deleted.
> This is only proposition...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CONNECTORS-567) Extended seeding interface which provides document versions

Posted by "Maciej Lizewski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CONNECTORS-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498046#comment-13498046 ] 

Maciej Lizewski commented on CONNECTORS-567:
--------------------------------------------

Would be nice to have the book.

As to your concerns - there will always be getDocumentVersions to handle such cases. I am not talking about removing this function. Just little extensions which can simplify process in some cases.
                
> Extended seeding interface which provides document versions
> -----------------------------------------------------------
>
>                 Key: CONNECTORS-567
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-567
>             Project: ManifoldCF
>          Issue Type: Wish
>            Reporter: Maciej Lizewski
>
> There are some cases when seeding function can provide document version with data it already has.
> Current data flow needs one call to addSeedDocuments, then call to getDocumentVersions, which essentialy must fetch same data, and after that one more call to processDocuments. The last one probably needs separate call because it needs to fetch document body, however seeding and getting versions in many cases work on very same data (and probably duplicating requests to repository).
> Now - reducing number of needed request to repository by eliminating getDocumentVersions call for document which have version returned by addSeedDocuments could significantly reduce load.
> getDocumentVersions would still be called for older docuemnts (not returned by addSeedDocuments) to check if they were modified or deleted.
> This is only proposition...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CONNECTORS-567) Extended seeding interface which provides document versions

Posted by "Maciej Lizewski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CONNECTORS-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497956#comment-13497956 ] 

Maciej Lizewski commented on CONNECTORS-567:
--------------------------------------------

I would also go with two scenarios to maintain compatibility with current model.

My point is that there plenty case when listing document also gives you information about its version: directory listing gives you file modyfication time, SQL query can return document ID and its version, web interfaces (REST, WebService) often support scenario: getObjectsList which gives you document IDs and almost always some document information like modyfication time, version, owner, etc and separate method for fetching whole document.

Your proposition to have all-in-one is not as good because: like I said earlier common interfaces have separate methods for fetching lists and single documents and you would have to first fetch the list and then for every document fetch its conent. Another reason is that in real world documents are not changed very often and fetching their content every time is much not needed overhead.

And last but not least - what I mean by "old enough" - when you call addSeedDocuments there are several scenarios but in most cases this method can provide new documents, updated documents and often all other documents that still exists. There are still some documents that were deleted and addSeedDocuemnts mostly will not return them. they are injected to reindexing process from database  of previously indexed document, and when getDocumentVersion returns null - they are removed. That is clear and this is what I mainly meant: getDocumentVesrions could be used to fetch versions for documents that are already in our database, but addSeedDocuemnts did not returned them (either because they were deleted or they were just not modified and addSeedDocuments just return new and modified documents)

So I was thinking of such (re)indexing process:
1. mark all already indexed document to re-index
2. call addSeedDocuments which can provide versions for documents or not
3. call getDocumentVersions for all documents that were not added by addSeedDocuments with version (this means that it should be called also for documents added by addSeedDocuemnts but without version - this is the backward compatibility)
4. call processDocuments as usual.

now - if addSeedDocuments does not provide versions at all this process is pretty same as it is working now. If addSeedDocuments provides versions for some(all) documents - those are excluded from calls to getDocumentVersions.

>From connector side the difference could be just in calling overloaded ISeedingActivity::addSeedDocument method with second argument:
addSeedDocument(idValue) or addSeedDocument(idValue, version)
of course I understand it means much more hidden work on the other side of this interface :)

What do think about it?
                
> Extended seeding interface which provides document versions
> -----------------------------------------------------------
>
>                 Key: CONNECTORS-567
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-567
>             Project: ManifoldCF
>          Issue Type: Wish
>            Reporter: Maciej Lizewski
>
> There are some cases when seeding function can provide document version with data it already has.
> Current data flow needs one call to addSeedDocuments, then call to getDocumentVersions, which essentialy must fetch same data, and after that one more call to processDocuments. The last one probably needs separate call because it needs to fetch document body, however seeding and getting versions in many cases work on very same data (and probably duplicating requests to repository).
> Now - reducing number of needed request to repository by eliminating getDocumentVersions call for document which have version returned by addSeedDocuments could significantly reduce load.
> getDocumentVersions would still be called for older docuemnts (not returned by addSeedDocuments) to check if they were modified or deleted.
> This is only proposition...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira