You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@any23.apache.org by lewismc <gi...@git.apache.org> on 2017/02/24 01:32:46 UTC

[GitHub] any23 pull request #34: ANY23-304 Add extractor for OpenIE

GitHub user lewismc opened a pull request:

    https://github.com/apache/any23/pull/34

    ANY23-304 Add extractor for OpenIE

    Hi Folks,
    This issue is a rework of #33 which takes on board @ansell 's comments to add the new extractor as a separate module as oppose to inside of core. 
    There are a number of classes which are cleaned up for JDK1.8 compliance.
    In addition, this new functionality augments the default configuration by introducing a threshold for OpenIE extractions of 0.5. Anything below this value is not converted into triples.
    I run a test extraction on a reasonably testing Webpage from the [PO.DAAC](http://podaac.jpl.nasa.gov/aquarius) but right now i am not asserting anything.
    As far as I can see this is working pretty well but some community review would go a long way.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lewismc/any23 ANY23-304

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/any23/pull/34.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #34
    
----
commit 2ecfbff1dddaf57689b725feddba47c7921f726d
Author: Lewis John McGibbney <le...@gmail.com>
Date:   2017-02-24T01:26:03Z

    ANY23-304 Add extractor for OpenIE

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 pull request #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on a diff in the pull request:

    https://github.com/apache/any23/pull/34#discussion_r103831681
  
    --- Diff: openie/src/test/java/org/apache/any23/openie/OpenIEExtractorTest.java ---
    @@ -0,0 +1,88 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *  http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.any23.openie;
    +
    +import java.io.ByteArrayOutputStream;
    +import java.io.IOException;
    +
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.apache.any23.extractor.ExtractionException;
    +import org.apache.any23.extractor.ExtractionParameters;
    +import org.apache.any23.extractor.ExtractionResult;
    +import org.apache.any23.extractor.ExtractionResultImpl;
    +import org.apache.any23.extractor.openie.OpenIEExtractor;
    +import org.apache.any23.rdf.RDFUtils;
    +import org.apache.any23.util.StreamUtils;
    +import org.apache.any23.writer.RDFXMLWriter;
    +import org.apache.any23.writer.TripleHandler;
    +import org.apache.any23.writer.TripleHandlerException;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.junit.After;
    +import org.junit.Before;
    +import org.junit.Test;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +/**
    + * @author lewismc
    + *
    + */
    +public class OpenIEExtractorTest {
    +
    +    private static final Logger logger = LoggerFactory.getLogger(OpenIEExtractorTest.class);
    +
    +    private OpenIEExtractor extractor;
    +
    +    @Before
    +    public void setUp() throws Exception {
    +        extractor = new OpenIEExtractor();
    +    }
    +
    +    @After
    +    public void tearDown() throws Exception {
    +        extractor = null;
    +    }
    +
    +    //@Ignore("This typically results in a JVM crash... disabled for the time being.")
    +    @Test
    +    public void testExtractFromHTMLDocument() 
    +      throws IOException, ExtractionException, TripleHandlerException {
    +        final IRI uri = RDFUtils.iri("http://podaac.jpl.nasa.gov/aquarius");
    +        extract(uri, "/org/apache/any23/extractor/openie/example-openie.html");
    +    }
    +    
    +    public void extract(IRI uri, String filePath) 
    +      throws IOException, ExtractionException, TripleHandlerException {
    +      ByteArrayOutputStream baos = new ByteArrayOutputStream();
    --- End diff --
    
    @ansell can you please elaborate? Its my understanding that we require the ```ByteArrayOutputStream``` as it acts as a parameter for each ```TripleHandler``` implementation e.g. ```RDFXMLWriter(baos)```.
    I would be happy to stream the extractions as an attempt to mitigate against OOM, however this would be after the extraction right? Not before, therefore I'm not sure how much memory we would be saving.
    
    If you could clarify it would be appreciated. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/34
  
    Will commit within next day or so if there are no objections.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by ansell <gi...@git.apache.org>.

Github user ansell commented on the issue:

    https://github.com/apache/any23/pull/34
  
    Tests failed for me with OOM:
    
    ```
    [INFO] Compiling 1 source file to /home/mint/gitrepos/any23/openie/target/test-classes
    [INFO] 
    [INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ apache-any23-openie ---
    
    -------------------------------------------------------
     T E S T S
    -------------------------------------------------------
    Running org.apache.any23.openie.OpenIEExtractorTest
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/home/mint/.m2/repository/ch/qos/logback/logback-classic/1.1.2/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/home/mint/.m2/repository/org/slf4j/slf4j-log4j12/1.7.21/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
    Loading feature templates.
    Loading models.
    Loading lexica.
    Loading configuration.
    Loading feature templates.
    Loading models.
    Loading feature templates.
    Loading models.
    Loading lexica.
    Loading feature templates.
    Loading models.
    Loading feature templates.
    Loading models.
    Loading lexica.
    Loading feature templates.
    Loading models.
    Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 20.977 sec <<< FAILURE! - in org.apache.any23.openie.OpenIEExtractorTest
    testExtractFromHTMLDocument(org.apache.any23.openie.OpenIEExtractorTest)  Time elapsed: 20.282 sec  <<< ERROR!
    java.lang.OutOfMemoryError: Java heap space
    	at org.apache.any23.openie.OpenIEExtractorTest.extract(OpenIEExtractorTest.java:75)
    	at org.apache.any23.openie.OpenIEExtractorTest.testExtractFromHTMLDocument(OpenIEExtractorTest.java:65)
    
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/34
  
    OK so implementing ExtractorPlugin is not necessary... none of the other plugins use this logic.
    I'm trying to get it working via cli appassembler script however no joy yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/34
  
    Unfortunately... due to the bugs regarding the ```META-INF/service``` directories being filtered out, it means that the plugins for Any23 2.0 are not as useful as they should be as they cannot be dynamically discovered if present on the classpath. We should potentially push Any23 2.1 once this patch is merged into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/34
  
    Hi @ansell , in my last commit I've pushed a coupe of (hopefully) satisfying additions, namely
     * removal of open module from CLI (meaning that, by default the open extractor is not executed by default during normal unit test execution)
     * addition of some class loading logic which improves the flexibility of extractor detection based upon the presence of the open extractor.
    
    By default now, open tests are not executed by default... this will dramatically reduce 1) the time of tests, and 2) he memory required to execute the tests.
    
    Thanks for any final review.
    Lewis


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 pull request #34: ANY23-304 Add extractor for OpenIE

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/any23/pull/34


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 pull request #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on a diff in the pull request:

    https://github.com/apache/any23/pull/34#discussion_r103832543
  
    --- Diff: openie/src/test/java/org/apache/any23/openie/OpenIEExtractorTest.java ---
    @@ -0,0 +1,88 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *  http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.any23.openie;
    +
    +import java.io.ByteArrayOutputStream;
    +import java.io.IOException;
    +
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.apache.any23.extractor.ExtractionException;
    +import org.apache.any23.extractor.ExtractionParameters;
    +import org.apache.any23.extractor.ExtractionResult;
    +import org.apache.any23.extractor.ExtractionResultImpl;
    +import org.apache.any23.extractor.openie.OpenIEExtractor;
    +import org.apache.any23.rdf.RDFUtils;
    +import org.apache.any23.util.StreamUtils;
    +import org.apache.any23.writer.RDFXMLWriter;
    +import org.apache.any23.writer.TripleHandler;
    +import org.apache.any23.writer.TripleHandlerException;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.junit.After;
    +import org.junit.Before;
    +import org.junit.Test;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +/**
    + * @author lewismc
    + *
    + */
    +public class OpenIEExtractorTest {
    +
    +    private static final Logger logger = LoggerFactory.getLogger(OpenIEExtractorTest.class);
    +
    +    private OpenIEExtractor extractor;
    +
    +    @Before
    +    public void setUp() throws Exception {
    +        extractor = new OpenIEExtractor();
    +    }
    +
    +    @After
    +    public void tearDown() throws Exception {
    +        extractor = null;
    +    }
    +
    +    //@Ignore("This typically results in a JVM crash... disabled for the time being.")
    +    @Test
    +    public void testExtractFromHTMLDocument() 
    +      throws IOException, ExtractionException, TripleHandlerException {
    +        final IRI uri = RDFUtils.iri("http://podaac.jpl.nasa.gov/aquarius");
    +        extract(uri, "/org/apache/any23/extractor/openie/example-openie.html");
    +    }
    +    
    +    public void extract(IRI uri, String filePath) 
    +      throws IOException, ExtractionException, TripleHandlerException {
    +      ByteArrayOutputStream baos = new ByteArrayOutputStream();
    --- End diff --
    
    OK I get you and yes this makes pefect sense.
    
    > Does the model load on every call to the CLI?
    
    ... yes... which I realize is far from ideal. The issue with this as well is that it will load on every document AFAIK so this is a major limitation of the approach as it currently sits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by ansell <gi...@git.apache.org>.

Github user ansell commented on the issue:

    https://github.com/apache/any23/pull/34
  
    My main objections before were about the larger memory requirements for default use and not being able to run the tests without OOM in my mid-range development machine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/34
  
    Yep your right. Bang on the money. I'll update the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by ansell <gi...@git.apache.org>.

Github user ansell commented on the issue:

    https://github.com/apache/any23/pull/34
  
    The cli module may need the new module added as a dependency to pull it onto the classpath. Strangely enough, it appears as though none of the other plugins are cli dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 pull request #34: ANY23-304 Add extractor for OpenIE

Posted by ansell <gi...@git.apache.org>.

Github user ansell commented on a diff in the pull request:

    https://github.com/apache/any23/pull/34#discussion_r103806159
  
    --- Diff: openie/src/main/java/org/apache/any23/extractor/openie/OpenIEExtractor.java ---
    @@ -0,0 +1,129 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *  http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.any23.extractor.openie;
    +
    +import java.io.IOException;
    +import java.util.List;
    +
    +import javax.xml.transform.TransformerConfigurationException;
    +import javax.xml.transform.TransformerFactoryConfigurationError;
    +
    +import org.apache.any23.extractor.Extractor;
    +import org.apache.any23.configuration.Configuration;
    +import org.apache.any23.configuration.DefaultConfiguration;
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.apache.any23.extractor.ExtractorDescription;
    +import org.apache.any23.rdf.RDFUtils;
    +import org.apache.any23.util.StreamUtils;
    +import org.apache.tika.Tika;
    +import org.apache.tika.exception.TikaException;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.eclipse.rdf4j.model.Resource;
    +import org.eclipse.rdf4j.model.Value;
    +import org.eclipse.rdf4j.model.vocabulary.RDF;
    +import org.eclipse.rdf4j.model.vocabulary.RDFS;
    +import org.apache.any23.extractor.ExtractionException;
    +import org.apache.any23.extractor.ExtractionParameters;
    +import org.apache.any23.extractor.ExtractionResult;
    +
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +import org.w3c.dom.Document;
    +
    +import edu.knowitall.openie.Argument;
    +import edu.knowitall.openie.Instance;
    +import edu.knowitall.openie.OpenIE;
    +import edu.knowitall.tool.parse.ClearParser;
    +import edu.knowitall.tool.postag.ClearPostagger;
    +import edu.knowitall.tool.srl.ClearSrl;
    +import edu.knowitall.tool.tokenize.ClearTokenizer;
    +import scala.collection.JavaConversions;
    +import scala.collection.Seq;
    +
    +/**
    + * An <a href="https://github.com/allenai/openie-standalone">OpenIE</a> 
    + * extractor able to generate <i>RDF</i> statements from 
    + * sentences representing relations in the text.
    + */
    +public class OpenIEExtractor implements Extractor.TagSoupDOMExtractor {
    +
    +    private static final Logger LOG = LoggerFactory.getLogger(OpenIEExtractor.class);
    +
    +    private IRI documentRoot;
    +
    +    /**
    +     * default constructor
    +     */
    +    public OpenIEExtractor() {
    +        // default constructor
    +    }
    +
    +    /**
    +     * @see org.apache.any23.extractor.Extractor#getDescription()
    +     */
    +    @Override
    +    public ExtractorDescription getDescription() {
    +        return OpenIEExtractorFactory.getDescriptionInstance();
    +    }
    +
    +    @Override
    +    public void run(ExtractionParameters extractionParameters,
    +            ExtractionContext context, Document in, ExtractionResult out)
    +                    throws IOException, ExtractionException {
    +
    +        IRI documentIRI = context.getDocumentIRI();
    +        documentRoot = RDFUtils.iri(documentIRI.toString() + "root");
    +        out.writeNamespace(RDF.PREFIX, RDF.NAMESPACE);
    +        out.writeNamespace(RDFS.PREFIX, RDFS.NAMESPACE);
    +        LOG.debug("Processing: {}", documentIRI.toString());
    +
    +        OpenIE openIE = new OpenIE(
    +                new ClearParser(
    +                        new ClearPostagger(
    +                                new ClearTokenizer())), new ClearSrl(), false, false);
    +
    +        Seq<Instance> extractions = null;
    +        Tika tika = new Tika();
    +        try {
    +            extractions = openIE.extract(tika.parseToString(StreamUtils.documentToInputStream(in)));
    +        } catch (TransformerConfigurationException | TransformerFactoryConfigurationError e) {
    +            LOG.error("Encountered error during OpenIE extraction.", e);
    +        } catch (TikaException e) {
    +            LOG.error("Encountered error whilst parsing InputStream with Tika.", e);
    +        }
    +
    +        List<Instance> listExtractions = JavaConversions.seqAsJavaList(extractions);
    +        // for each extraction instance we can obtain a number of extraction elements
    +        // instance.confidence() - a confidence value for the extraction itself
    +        // instance.extr().context() - an optional representation of the context for this extraction
    +        // instance.extr().arg1().text() - subject
    +        // instance.extr().rel().text() - predicate
    +        // instance.extr().arg2s().text() - object
    +        for(Instance instance : listExtractions) {
    +            final Configuration immutableConf = DefaultConfiguration.singleton();
    +            if (instance.confidence() > Double.parseDouble(immutableConf.getProperty("any23.extraction.openie.confidence.threshold", "0.5"))) {
    --- End diff --
    
    The Double.parseDouble here should be done once and stored in a local variable


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 pull request #34: ANY23-304 Add extractor for OpenIE

Posted by ansell <gi...@git.apache.org>.

Github user ansell commented on a diff in the pull request:

    https://github.com/apache/any23/pull/34#discussion_r103832125
  
    --- Diff: openie/src/test/java/org/apache/any23/openie/OpenIEExtractorTest.java ---
    @@ -0,0 +1,88 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *  http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.any23.openie;
    +
    +import java.io.ByteArrayOutputStream;
    +import java.io.IOException;
    +
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.apache.any23.extractor.ExtractionException;
    +import org.apache.any23.extractor.ExtractionParameters;
    +import org.apache.any23.extractor.ExtractionResult;
    +import org.apache.any23.extractor.ExtractionResultImpl;
    +import org.apache.any23.extractor.openie.OpenIEExtractor;
    +import org.apache.any23.rdf.RDFUtils;
    +import org.apache.any23.util.StreamUtils;
    +import org.apache.any23.writer.RDFXMLWriter;
    +import org.apache.any23.writer.TripleHandler;
    +import org.apache.any23.writer.TripleHandlerException;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.junit.After;
    +import org.junit.Before;
    +import org.junit.Test;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +/**
    + * @author lewismc
    + *
    + */
    +public class OpenIEExtractorTest {
    +
    +    private static final Logger logger = LoggerFactory.getLogger(OpenIEExtractorTest.class);
    +
    +    private OpenIEExtractor extractor;
    +
    +    @Before
    +    public void setUp() throws Exception {
    +        extractor = new OpenIEExtractor();
    +    }
    +
    +    @After
    +    public void tearDown() throws Exception {
    +        extractor = null;
    +    }
    +
    +    //@Ignore("This typically results in a JVM crash... disabled for the time being.")
    +    @Test
    +    public void testExtractFromHTMLDocument() 
    +      throws IOException, ExtractionException, TripleHandlerException {
    +        final IRI uri = RDFUtils.iri("http://podaac.jpl.nasa.gov/aquarius");
    +        extract(uri, "/org/apache/any23/extractor/openie/example-openie.html");
    +    }
    +    
    +    public void extract(IRI uri, String filePath) 
    +      throws IOException, ExtractionException, TripleHandlerException {
    +      ByteArrayOutputStream baos = new ByteArrayOutputStream();
    --- End diff --
    
    ByteArrayOutputStream will hold all of the results in memory. It may be useful to create a temporary file and reference it as a FileOutputStream, which will have a fixed memory buffer before writing to disk. Just trying to work through the possible avenues where memory requirements can be managed. It may be useful to work through in a debugger to identity the large memory requirement and where that can be lowered, as hopefully the CLI can still be used on small machines after this pull request. Does the model load on every call to the CLI?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by ansell <gi...@git.apache.org>.

Github user ansell commented on the issue:

    https://github.com/apache/any23/pull/34
  
    I haven't looked at it recently. The META-INF/services should be enough on their own without the explicit plugin support but I can't recall whether there are any other differences that could affect usage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/34
  
    Hi @ansell this is now fixed... if you could pull the code and let me know how you get on it would be appreciated.
    After a good bit of debugging I discovered that some erroneous ```<resources>``` descriptions in plugin pom.xml files meant that the ```META-INF/service``` directories were being filtered out from the generated .jar artifacts... meaning that the ServiceLoader did not discover them.
    Anyway... if you could pull the code and let me know how you get on it would be appreciated. This is working well for me.
    One final thing to note, you will see that for the appassembler plugin definition in ```cli/pom.xml``` we increase the JVM arguments to 6000m... this is because OpenIE is pretty memory intensive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/34
  
    Hi @ansell I finally got around to addressing your comments. Just to refresh your memory, use of FileOutputStream (as oppose to ByteArrayOutputStream) within the OpenExtractorTest.java logic is more performant, by around 1/4 second or so. 
    Do you have any further comments on this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/34
  
    @ansell is it necessary to put this new module into ```plugins``` and have the new extractor implement [ExtractorPlugin](http://any23.apache.org/apidocs/index.html?org/apache/any23/plugin/ExtractorPlugin.html)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 pull request #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on a diff in the pull request:

    https://github.com/apache/any23/pull/34#discussion_r103836818
  
    --- Diff: openie/src/main/java/org/apache/any23/extractor/openie/OpenIEExtractor.java ---
    @@ -0,0 +1,129 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *  http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.any23.extractor.openie;
    +
    +import java.io.IOException;
    +import java.util.List;
    +
    +import javax.xml.transform.TransformerConfigurationException;
    +import javax.xml.transform.TransformerFactoryConfigurationError;
    +
    +import org.apache.any23.extractor.Extractor;
    +import org.apache.any23.configuration.Configuration;
    +import org.apache.any23.configuration.DefaultConfiguration;
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.apache.any23.extractor.ExtractorDescription;
    +import org.apache.any23.rdf.RDFUtils;
    +import org.apache.any23.util.StreamUtils;
    +import org.apache.tika.Tika;
    +import org.apache.tika.exception.TikaException;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.eclipse.rdf4j.model.Resource;
    +import org.eclipse.rdf4j.model.Value;
    +import org.eclipse.rdf4j.model.vocabulary.RDF;
    +import org.eclipse.rdf4j.model.vocabulary.RDFS;
    +import org.apache.any23.extractor.ExtractionException;
    +import org.apache.any23.extractor.ExtractionParameters;
    +import org.apache.any23.extractor.ExtractionResult;
    +
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +import org.w3c.dom.Document;
    +
    +import edu.knowitall.openie.Argument;
    +import edu.knowitall.openie.Instance;
    +import edu.knowitall.openie.OpenIE;
    +import edu.knowitall.tool.parse.ClearParser;
    +import edu.knowitall.tool.postag.ClearPostagger;
    +import edu.knowitall.tool.srl.ClearSrl;
    +import edu.knowitall.tool.tokenize.ClearTokenizer;
    +import scala.collection.JavaConversions;
    +import scala.collection.Seq;
    +
    +/**
    + * An <a href="https://github.com/allenai/openie-standalone">OpenIE</a> 
    + * extractor able to generate <i>RDF</i> statements from 
    + * sentences representing relations in the text.
    + */
    +public class OpenIEExtractor implements Extractor.TagSoupDOMExtractor {
    +
    +    private static final Logger LOG = LoggerFactory.getLogger(OpenIEExtractor.class);
    +
    +    private IRI documentRoot;
    +
    +    /**
    +     * default constructor
    +     */
    +    public OpenIEExtractor() {
    +        // default constructor
    +    }
    +
    +    /**
    +     * @see org.apache.any23.extractor.Extractor#getDescription()
    +     */
    +    @Override
    +    public ExtractorDescription getDescription() {
    +        return OpenIEExtractorFactory.getDescriptionInstance();
    +    }
    +
    +    @Override
    +    public void run(ExtractionParameters extractionParameters,
    +            ExtractionContext context, Document in, ExtractionResult out)
    +                    throws IOException, ExtractionException {
    +
    +        IRI documentIRI = context.getDocumentIRI();
    +        documentRoot = RDFUtils.iri(documentIRI.toString() + "root");
    +        out.writeNamespace(RDF.PREFIX, RDF.NAMESPACE);
    +        out.writeNamespace(RDFS.PREFIX, RDFS.NAMESPACE);
    +        LOG.debug("Processing: {}", documentIRI.toString());
    +
    +        OpenIE openIE = new OpenIE(
    +                new ClearParser(
    +                        new ClearPostagger(
    +                                new ClearTokenizer())), new ClearSrl(), false, false);
    +
    +        Seq<Instance> extractions = null;
    +        Tika tika = new Tika();
    +        try {
    +            extractions = openIE.extract(tika.parseToString(StreamUtils.documentToInputStream(in)));
    +        } catch (TransformerConfigurationException | TransformerFactoryConfigurationError e) {
    +            LOG.error("Encountered error during OpenIE extraction.", e);
    +        } catch (TikaException e) {
    +            LOG.error("Encountered error whilst parsing InputStream with Tika.", e);
    +        }
    +
    +        List<Instance> listExtractions = JavaConversions.seqAsJavaList(extractions);
    +        // for each extraction instance we can obtain a number of extraction elements
    +        // instance.confidence() - a confidence value for the extraction itself
    +        // instance.extr().context() - an optional representation of the context for this extraction
    +        // instance.extr().arg1().text() - subject
    +        // instance.extr().rel().text() - predicate
    +        // instance.extr().arg2s().text() - object
    +        for(Instance instance : listExtractions) {
    +            final Configuration immutableConf = DefaultConfiguration.singleton();
    +            if (instance.confidence() > Double.parseDouble(immutableConf.getProperty("any23.extraction.openie.confidence.threshold", "0.5"))) {
    --- End diff --
    
    I addressed this @ansell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by ansell <gi...@git.apache.org>.

Github user ansell commented on the issue:

    https://github.com/apache/any23/pull/34
  
    Is it an optional plugin in the current setup to avoid having users need to load it if they have minimal memory available. I haven't had time to look through it, but I see there is a new openie module.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 pull request #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on a diff in the pull request:

    https://github.com/apache/any23/pull/34#discussion_r103829816
  
    --- Diff: openie/src/main/java/org/apache/any23/extractor/openie/OpenIEExtractor.java ---
    @@ -0,0 +1,129 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *  http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.any23.extractor.openie;
    +
    +import java.io.IOException;
    +import java.util.List;
    +
    +import javax.xml.transform.TransformerConfigurationException;
    +import javax.xml.transform.TransformerFactoryConfigurationError;
    +
    +import org.apache.any23.extractor.Extractor;
    +import org.apache.any23.configuration.Configuration;
    +import org.apache.any23.configuration.DefaultConfiguration;
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.apache.any23.extractor.ExtractorDescription;
    +import org.apache.any23.rdf.RDFUtils;
    +import org.apache.any23.util.StreamUtils;
    +import org.apache.tika.Tika;
    +import org.apache.tika.exception.TikaException;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.eclipse.rdf4j.model.Resource;
    +import org.eclipse.rdf4j.model.Value;
    +import org.eclipse.rdf4j.model.vocabulary.RDF;
    +import org.eclipse.rdf4j.model.vocabulary.RDFS;
    +import org.apache.any23.extractor.ExtractionException;
    +import org.apache.any23.extractor.ExtractionParameters;
    +import org.apache.any23.extractor.ExtractionResult;
    +
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +import org.w3c.dom.Document;
    +
    +import edu.knowitall.openie.Argument;
    +import edu.knowitall.openie.Instance;
    +import edu.knowitall.openie.OpenIE;
    +import edu.knowitall.tool.parse.ClearParser;
    +import edu.knowitall.tool.postag.ClearPostagger;
    +import edu.knowitall.tool.srl.ClearSrl;
    +import edu.knowitall.tool.tokenize.ClearTokenizer;
    +import scala.collection.JavaConversions;
    +import scala.collection.Seq;
    +
    +/**
    + * An <a href="https://github.com/allenai/openie-standalone">OpenIE</a> 
    + * extractor able to generate <i>RDF</i> statements from 
    + * sentences representing relations in the text.
    + */
    +public class OpenIEExtractor implements Extractor.TagSoupDOMExtractor {
    +
    +    private static final Logger LOG = LoggerFactory.getLogger(OpenIEExtractor.class);
    +
    +    private IRI documentRoot;
    +
    +    /**
    +     * default constructor
    +     */
    +    public OpenIEExtractor() {
    +        // default constructor
    +    }
    +
    +    /**
    +     * @see org.apache.any23.extractor.Extractor#getDescription()
    +     */
    +    @Override
    +    public ExtractorDescription getDescription() {
    +        return OpenIEExtractorFactory.getDescriptionInstance();
    +    }
    +
    +    @Override
    +    public void run(ExtractionParameters extractionParameters,
    +            ExtractionContext context, Document in, ExtractionResult out)
    +                    throws IOException, ExtractionException {
    +
    +        IRI documentIRI = context.getDocumentIRI();
    +        documentRoot = RDFUtils.iri(documentIRI.toString() + "root");
    +        out.writeNamespace(RDF.PREFIX, RDF.NAMESPACE);
    +        out.writeNamespace(RDFS.PREFIX, RDFS.NAMESPACE);
    +        LOG.debug("Processing: {}", documentIRI.toString());
    +
    +        OpenIE openIE = new OpenIE(
    +                new ClearParser(
    +                        new ClearPostagger(
    +                                new ClearTokenizer())), new ClearSrl(), false, false);
    +
    +        Seq<Instance> extractions = null;
    +        Tika tika = new Tika();
    +        try {
    +            extractions = openIE.extract(tika.parseToString(StreamUtils.documentToInputStream(in)));
    +        } catch (TransformerConfigurationException | TransformerFactoryConfigurationError e) {
    +            LOG.error("Encountered error during OpenIE extraction.", e);
    +        } catch (TikaException e) {
    +            LOG.error("Encountered error whilst parsing InputStream with Tika.", e);
    +        }
    +
    +        List<Instance> listExtractions = JavaConversions.seqAsJavaList(extractions);
    +        // for each extraction instance we can obtain a number of extraction elements
    +        // instance.confidence() - a confidence value for the extraction itself
    +        // instance.extr().context() - an optional representation of the context for this extraction
    +        // instance.extr().arg1().text() - subject
    +        // instance.extr().rel().text() - predicate
    +        // instance.extr().arg2s().text() - object
    +        for(Instance instance : listExtractions) {
    +            final Configuration immutableConf = DefaultConfiguration.singleton();
    +            if (instance.confidence() > Double.parseDouble(immutableConf.getProperty("any23.extraction.openie.confidence.threshold", "0.5"))) {
    +                List<Argument> listArg2s = JavaConversions.seqAsJavaList(instance.extr().arg2s());
    --- End diff --
    
    @ansell yes having debugged the code pretty extensively by now, it read the entire sequence into memory. You can see the [Scala Docs here](http://www.scala-lang.org/api/2.11.8/#scala.collection.JavaConversions$), for a bit of additional context, they state
    
    > Implicitly converts a Scala Seq to a Java List. The returned Java List is backed by the provided Scala Seq and any side-effects of using it via the Java interface will be visible via the Scala interface and vice versa. If the Scala Seq was previously obtained from an implicit or explicit call of asSeq(java.util.List) then the original Java List will be returned. 
    
    This being said, I haven;t noticed the size of ```instance.extr().arg2s()``` being overly large so far.
    
    My feeling is that OOM's are stemming from loading the model(s), however I may be wrong. Over on the [openie README](https://github.com/allenai/openie-standalone#memory-requirements) it states "...openie requires substantial memory.".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 pull request #34: ANY23-304 Add extractor for OpenIE

Posted by ansell <gi...@git.apache.org>.

Github user ansell commented on a diff in the pull request:

    https://github.com/apache/any23/pull/34#discussion_r103806389
  
    --- Diff: openie/src/main/java/org/apache/any23/extractor/openie/OpenIEExtractor.java ---
    @@ -0,0 +1,129 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *  http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.any23.extractor.openie;
    +
    +import java.io.IOException;
    +import java.util.List;
    +
    +import javax.xml.transform.TransformerConfigurationException;
    +import javax.xml.transform.TransformerFactoryConfigurationError;
    +
    +import org.apache.any23.extractor.Extractor;
    +import org.apache.any23.configuration.Configuration;
    +import org.apache.any23.configuration.DefaultConfiguration;
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.apache.any23.extractor.ExtractorDescription;
    +import org.apache.any23.rdf.RDFUtils;
    +import org.apache.any23.util.StreamUtils;
    +import org.apache.tika.Tika;
    +import org.apache.tika.exception.TikaException;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.eclipse.rdf4j.model.Resource;
    +import org.eclipse.rdf4j.model.Value;
    +import org.eclipse.rdf4j.model.vocabulary.RDF;
    +import org.eclipse.rdf4j.model.vocabulary.RDFS;
    +import org.apache.any23.extractor.ExtractionException;
    +import org.apache.any23.extractor.ExtractionParameters;
    +import org.apache.any23.extractor.ExtractionResult;
    +
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +import org.w3c.dom.Document;
    +
    +import edu.knowitall.openie.Argument;
    +import edu.knowitall.openie.Instance;
    +import edu.knowitall.openie.OpenIE;
    +import edu.knowitall.tool.parse.ClearParser;
    +import edu.knowitall.tool.postag.ClearPostagger;
    +import edu.knowitall.tool.srl.ClearSrl;
    +import edu.knowitall.tool.tokenize.ClearTokenizer;
    +import scala.collection.JavaConversions;
    +import scala.collection.Seq;
    +
    +/**
    + * An <a href="https://github.com/allenai/openie-standalone">OpenIE</a> 
    + * extractor able to generate <i>RDF</i> statements from 
    + * sentences representing relations in the text.
    + */
    +public class OpenIEExtractor implements Extractor.TagSoupDOMExtractor {
    +
    +    private static final Logger LOG = LoggerFactory.getLogger(OpenIEExtractor.class);
    +
    +    private IRI documentRoot;
    +
    +    /**
    +     * default constructor
    +     */
    +    public OpenIEExtractor() {
    +        // default constructor
    +    }
    +
    +    /**
    +     * @see org.apache.any23.extractor.Extractor#getDescription()
    +     */
    +    @Override
    +    public ExtractorDescription getDescription() {
    +        return OpenIEExtractorFactory.getDescriptionInstance();
    +    }
    +
    +    @Override
    +    public void run(ExtractionParameters extractionParameters,
    +            ExtractionContext context, Document in, ExtractionResult out)
    +                    throws IOException, ExtractionException {
    +
    +        IRI documentIRI = context.getDocumentIRI();
    +        documentRoot = RDFUtils.iri(documentIRI.toString() + "root");
    +        out.writeNamespace(RDF.PREFIX, RDF.NAMESPACE);
    +        out.writeNamespace(RDFS.PREFIX, RDFS.NAMESPACE);
    +        LOG.debug("Processing: {}", documentIRI.toString());
    +
    +        OpenIE openIE = new OpenIE(
    +                new ClearParser(
    +                        new ClearPostagger(
    +                                new ClearTokenizer())), new ClearSrl(), false, false);
    +
    +        Seq<Instance> extractions = null;
    +        Tika tika = new Tika();
    +        try {
    +            extractions = openIE.extract(tika.parseToString(StreamUtils.documentToInputStream(in)));
    +        } catch (TransformerConfigurationException | TransformerFactoryConfigurationError e) {
    +            LOG.error("Encountered error during OpenIE extraction.", e);
    +        } catch (TikaException e) {
    +            LOG.error("Encountered error whilst parsing InputStream with Tika.", e);
    +        }
    +
    +        List<Instance> listExtractions = JavaConversions.seqAsJavaList(extractions);
    +        // for each extraction instance we can obtain a number of extraction elements
    +        // instance.confidence() - a confidence value for the extraction itself
    +        // instance.extr().context() - an optional representation of the context for this extraction
    +        // instance.extr().arg1().text() - subject
    +        // instance.extr().rel().text() - predicate
    +        // instance.extr().arg2s().text() - object
    +        for(Instance instance : listExtractions) {
    +            final Configuration immutableConf = DefaultConfiguration.singleton();
    +            if (instance.confidence() > Double.parseDouble(immutableConf.getProperty("any23.extraction.openie.confidence.threshold", "0.5"))) {
    +                List<Argument> listArg2s = JavaConversions.seqAsJavaList(instance.extr().arg2s());
    --- End diff --
    
    I am not very familiar with Scala, but does this bring an Iterator-like element into memory completely instead of processing it in a streaming fashion (just trying to understand why the OOM are occurring).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/34
  
    PING... anyone that is able to provide a review? Would be very much appreciated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

Posted by lewismc <gi...@git.apache.org>.

Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/34
  
    Hi @ansell yes this is a separate module however currently it always builds with CLI module. I'm going to push an update which disables the module tests by default.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 pull request #34: ANY23-304 Add extractor for OpenIE

Posted by ansell <gi...@git.apache.org>.

Github user ansell commented on a diff in the pull request:

    https://github.com/apache/any23/pull/34#discussion_r103806632
  
    --- Diff: openie/src/test/java/org/apache/any23/openie/OpenIEExtractorTest.java ---
    @@ -0,0 +1,88 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *  http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.any23.openie;
    +
    +import java.io.ByteArrayOutputStream;
    +import java.io.IOException;
    +
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.apache.any23.extractor.ExtractionException;
    +import org.apache.any23.extractor.ExtractionParameters;
    +import org.apache.any23.extractor.ExtractionResult;
    +import org.apache.any23.extractor.ExtractionResultImpl;
    +import org.apache.any23.extractor.openie.OpenIEExtractor;
    +import org.apache.any23.rdf.RDFUtils;
    +import org.apache.any23.util.StreamUtils;
    +import org.apache.any23.writer.RDFXMLWriter;
    +import org.apache.any23.writer.TripleHandler;
    +import org.apache.any23.writer.TripleHandlerException;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.junit.After;
    +import org.junit.Before;
    +import org.junit.Test;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +/**
    + * @author lewismc
    + *
    + */
    +public class OpenIEExtractorTest {
    +
    +    private static final Logger logger = LoggerFactory.getLogger(OpenIEExtractorTest.class);
    +
    +    private OpenIEExtractor extractor;
    +
    +    @Before
    +    public void setUp() throws Exception {
    +        extractor = new OpenIEExtractor();
    +    }
    +
    +    @After
    +    public void tearDown() throws Exception {
    +        extractor = null;
    +    }
    +
    +    //@Ignore("This typically results in a JVM crash... disabled for the time being.")
    +    @Test
    +    public void testExtractFromHTMLDocument() 
    +      throws IOException, ExtractionException, TripleHandlerException {
    +        final IRI uri = RDFUtils.iri("http://podaac.jpl.nasa.gov/aquarius");
    +        extract(uri, "/org/apache/any23/extractor/openie/example-openie.html");
    +    }
    +    
    +    public void extract(IRI uri, String filePath) 
    +      throws IOException, ExtractionException, TripleHandlerException {
    +      ByteArrayOutputStream baos = new ByteArrayOutputStream();
    --- End diff --
    
    Writing to a file instead of ByteArrayOutputStream may alleviate some of the memory pressures.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---