You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2008/01/25 22:04:36 UTC

[jira] Commented: (LUCENE-1156) Wikipedia Document Generation Changes

    [ https://issues.apache.org/jira/browse/LUCENE-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12562674#action_12562674 ] 

Grant Ingersoll commented on LUCENE-1156:
-----------------------------------------

I should amend the Template item.  They are not useless, but I am not exactly sure how to incorporate them.  From what I can understand, they are used for transclusion (Wikipedia's term) in other documents.  Something like an import statement, if you will.  They can thus be reused across many articles.  So, technically, in order to properly index, one should resolve all the transclusions first, then index.  This doesn't seem worthwhile to do from a parsing standpoint.

> Wikipedia Document Generation Changes
> -------------------------------------
>
>                 Key: LUCENE-1156
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1156
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark, contrib/wikipedia
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>
> The EnwikiDocMaker currently produces a fair number of documents that are in the download, but are of dubious use in terms of both benchmarking and indexing.  
> These issues are:
> # Redirect (it currently only handles REDIRECT and redirect, but there are documents as Redirect
> # Template files appear to be useless.  These are marked by the term Template: at the beginning of the body.  See for example: http://en.wikipedia.org/wiki/Template:=)
> # Image only pages, as in http://en.wikipedia.org/wiki/Image:Sciencefieldnewark.jpg.jpg  These are about as useful as the Redirects and Templates
> # Files pending deletion:  This one is a bit trickier to handle, but they are generally marked by "Wikipedia:Votes for deletion" or some variation of that depending where along it is in being deleted
> I think I can implement this such that it is backward compatible, if there is such a need when it comes to the contrib/benchmark suite.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org