You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Olivier Grisel (JIRA)" <ji...@apache.org> on 2011/07/04 14:06:21 UTC

[jira] [Issue Comment Edited] (OPENNLP-211) Add a Wikinews parser to the wikinews-importer

    [ https://issues.apache.org/jira/browse/OPENNLP-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059403#comment-13059403 ] 

Olivier Grisel edited comment on OPENNLP-211 at 7/4/11 12:06 PM:
-----------------------------------------------------------------

Please feel free to reuse and adapt the following classes from the pignlproc project.

The meat is in AnnotatingMarkupParser.java / Annotation.java, 

Sample usage (in a pig context, just as a reference) in ParsingWikipediaLoader.java

Some tests for AnnotatingMarkupParser available in TestWikipediaParsing.java.

The main class is based on the following dependency:

https://code.google.com/p/gwtwiki/ (licensed under EPL 1.0 hence compatible with the ASF rules if distributed only in binary format, e.g through maven):

    <dependency>
      <groupId>info.bliki.wiki</groupId>
      <artifactId>bliki-core</artifactId>
      <version>3.0.16</version>
    </dependency>

Note: the gwtwiki project now feature a new API with Helpers dedicated to MediaWiki markup dump parsing here:
   https://code.google.com/p/gwtwiki/wiki/MediaWikiDumpSupport

IIRC those helpers were not available when I started the pignlproc tools. Might be useful to investigate directly too.

      was (Author: ogrisel):
    Please feel free to reuse and adapt the following classes from the pignlproc project.

The meat is in AnnotatingMarkupParser.java / Annotation.java, 

Sample usage (in a pig context, just as a reference) in ParsingWikipediaLoader.java

Some tests fo AnnotatingMarkupParser available in TestWikipediaParsing.java.

The main class is based on the following dependency:

https://code.google.com/p/gwtwiki/ (license under EPL 1.0 hence compatible with the ASF rules if distributed only in binary format, e.g through maven):

    <dependency>
      <groupId>info.bliki.wiki</groupId>
      <artifactId>bliki-core</artifactId>
      <version>3.0.16</version>
    </dependency>

Note: the gwtwiki project now feature a new API with Helpers dedicated to MediaWiki markup dump parsing here:
   https://code.google.com/p/gwtwiki/wiki/MediaWikiDumpSupport

IIRC those helpers were not available when I started the pignlproc tools. Might be useful to investigate directly too.
  
> Add a Wikinews parser to the wikinews-importer
> ----------------------------------------------
>
>                 Key: OPENNLP-211
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-211
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Jörn Kottmann
>         Attachments: AnnotatingMarkupParser.java, Annotation.java, ParsingWikipediaLoader.java, TestWikipediaParsing.java
>
>
> The current wikinews-importer can only load existing XMI files, that should be fixed by adding a proper wikinews parser wich can turn the wikinews dump into UIMA CASes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira