You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (Issue Comment Edited) (JIRA)" <ji...@apache.org> on 2011/10/14 15:32:11 UTC

[jira] [Issue Comment Edited] (CONNECTORS-256) Connector for crawling Wikis

    [ https://issues.apache.org/jira/browse/CONNECTORS-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127534#comment-13127534 ] 

Karl Wright edited comment on CONNECTORS-256 at 10/14/11 1:30 PM:
------------------------------------------------------------------

bq. Also, information about the last_modified date was missing completely.

The last_modified date I did not include in the metadata indexed; this is why I wanted other folks to try it out.  I've opened a new ticket to capture that enhancement: CONNECTORS-273.

                
      was (Author: kwright@metacarta.com):
    bq. Also, information about the last_modified date was missing completely.

The last_modified date I did not include in the metadata indexed; this is why I wanted other folks to try it out.  I'll open a new ticket to capture that enhancement.

                  
> Connector for crawling Wikis
> ----------------------------
>
>                 Key: CONNECTORS-256
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-256
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Wiki connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.4
>
>
> People have been trying to crawl wikis with ManifoldCF, but using the generic crawler is not a good way to do this.  Instead, it looks like we really could use a wiki connector, which would understand the wiki API and thus crawl wiki content quickly and effectively.
> Some pertinent API references follow:
> I don't know if it is possible to link to a wiki document with just the pageid, but it is possible to to get the url for the referring pageid via api:
> http://en.wikipedia.org/w/api.php?action=query&prop=info&pageids=27697087&inprop=url
> It is possible to get the metadata of a document using the pages id (instead of title) directly:
> Titel -> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=API&rvprop=timestamp|user|comment|content
> PageID -> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&pageids=27697087&rvprop=timestamp|user|comment|content
> - There needs to be some notion of an overall list of pages:
>        - http://www.mediawiki.org/wiki/API:Allpages
>        - Example: http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=Kre&aplimit=5
> - Metadata information (author and pub date) also needs to be separated out in some way:
>        - http://www.mediawiki.org/wiki/API:Properties#Revisions:_Example
>        - Example:  http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=API|Main%20Page&rvprop=timestamp|user|comment|content

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira