You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@any23.apache.org by jgrzebyta <gi...@git.apache.org> on 2018/09/06 12:21:29 UTC

[GitHub] any23 pull request #121: Any23 396: Add ability to run extractors in flow

GitHub user jgrzebyta opened a pull request:

    https://github.com/apache/any23/pull/121

    Any23 396: Add ability to run extractors in flow 

    - Add unit test with example. The test contains expected model.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jgrzebyta/any23 ANY23-396

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/any23/pull/121.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #121
    
----
commit 02259062d36f64ceb08ea5183c5136767f7ca873
Author: Jacek Grzebyta <jg...@...>
Date:   2018-09-06T11:58:34Z

    Ref ANY23-392
    
    - add unit test with expected solution

commit d68426252576e936f7216bf5d4c409e85bbbd861
Author: Jacek Grzebyta <jg...@...>
Date:   2018-09-06T12:18:56Z

    Ref ANY23-392
    
    - add --workflows command line argument

----


---

[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

Posted by HansBrende <gi...@git.apache.org>.

Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/121
  
    Aside from the comments I've made on this PR, I'm still not convinced that having a ModelExtractor is a good idea in the first place. Why not just create a ModelWriter (as in ANY23-397) or an equivalent "collecting" TripleHandler, and then allow the end user to transform the collected statements however they wish?
    
    Having a ModelExtractor creates additional questions & complexities: in what order are the extractors executed? (Certainly the ModelExtractors would have to be executed last in order to have access to all previously collected statements.) What if multiple ModelExtractors are declared? Which ones have higher precedence in the extraction order?
    
    I'm not sure that having a dedicated ModelExtractor is worth the trouble of dealing with these complexities, when a user could accomplish the same thing by simply transforming the statements collected by a ModelWriter or equivalent, or defining their own filtering and/or mapping TripleHandler.


---

[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

Posted by HansBrende <gi...@git.apache.org>.

Github user HansBrende commented on a diff in the pull request:

    https://github.com/apache/any23/pull/121#discussion_r217159830
  
    --- Diff: core/src/main/java/org/apache/any23/writer/BufferedTripleHandler.java ---
    @@ -0,0 +1,161 @@
    +package org.apache.any23.writer;
    +
    +import com.google.common.base.Throwables;
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.eclipse.rdf4j.model.Model;
    +import org.eclipse.rdf4j.model.Resource;
    +import org.eclipse.rdf4j.model.Value;
    +import org.eclipse.rdf4j.model.impl.LinkedHashModelFactory;
    +import org.eclipse.rdf4j.model.impl.TreeModelFactory;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.util.Map;
    +import java.util.Stack;
    +import java.util.TreeMap;
    +
    +/**
    + * Collects all statements until end document.
    + *
    + * All statements are kept within {@link Model}.
    + *
    + * @author Jacek Grzebyta (jgrzebyta@apache.org)
    + */
    +public class BufferedTripleHandler implements TripleHandler {
    +
    +    private static final Logger log = LoggerFactory.getLogger(BufferedTripleHandler.class);
    +    private TripleHandler underlying;
    +    private static boolean isDocumentFinish = false;
    +
    +    private static class ContextHandler {
    +        ContextHandler(ExtractionContext ctx, Model m) {
    +            extractionContext = ctx;
    +            extractionModel = m;
    +        }
    +        ExtractionContext extractionContext;
    +        Model extractionModel;
    +    }
    +
    +    private static class WorkflowContext {
    +        WorkflowContext(TripleHandler underlying) {
    +            this.rootHandler = underlying;
    +        }
    +
    +
    +        Stack<String> extractors = new Stack<>();
    +        Map<String, ContextHandler> modelMap = new TreeMap<>();
    +        IRI documentIRI = null;
    +        TripleHandler rootHandler ;
    +    }
    +
    +    public BufferedTripleHandler(TripleHandler underlying) {
    +        this.underlying = underlying;
    +
    +        // hide model in the thread
    +        WorkflowContext wc = new WorkflowContext(underlying);
    +        BufferedTripleHandler.workflowContext.set(wc);
    +    }
    +
    +    private static final ThreadLocal<WorkflowContext> workflowContext = new ThreadLocal<>();
    +
    +    /**
    +     * Returns model which contains all other models.
    +     * @return
    +     */
    +    public static Model getModel() {
    +        return BufferedTripleHandler.workflowContext.get().modelMap.values().stream()
    +                .map(ch -> ch.extractionModel)
    +                .reduce(new LinkedHashModelFactory().createEmptyModel(), (mf, exm) -> {
    +                    mf.addAll(exm);
    +                    return mf;
    +                });
    +    }
    +
    +    @Override
    +    public void startDocument(IRI documentIRI) throws TripleHandlerException {
    +        BufferedTripleHandler.workflowContext.get().documentIRI = documentIRI;
    +    }
    +
    +    @Override
    +    public void openContext(ExtractionContext context) throws TripleHandlerException {
    +        //
    +    }
    +
    +    @Override
    +    public void receiveTriple(Resource s, IRI p, Value o, IRI g, ExtractionContext context) throws TripleHandlerException {
    +        getModelForContext(context).add(s,p,o,g);
    +    }
    +
    +    @Override
    +    public void receiveNamespace(String prefix, String uri, ExtractionContext context) throws TripleHandlerException {
    +        getModelForContext(context).setNamespace(prefix, uri);
    +    }
    +
    +    @Override
    +    public void closeContext(ExtractionContext context) throws TripleHandlerException {
    +        //
    +    }
    +
    +    @Override
    +    public void endDocument(IRI documentIRI) throws TripleHandlerException {
    +        BufferedTripleHandler.isDocumentFinish = true;
    +    }
    +
    +    @Override
    +    public void setContentLength(long contentLength) {
    +        underlying.setContentLength(contentLength);
    +    }
    +
    +    @Override
    +    public void close() throws TripleHandlerException {
    +        underlying.close();
    +    }
    +
    +    /**
    +     * Releases content of the model into underlying writer.
    +     */
    +    public static void releaseModel() throws TripleHandlerException {
    +        if(!BufferedTripleHandler.isDocumentFinish) {
    +            throw new RuntimeException("Before releasing document should be finished.");
    +        }
    +
    +        WorkflowContext workflowContext = BufferedTripleHandler.workflowContext.get();
    +
    +        String lastExtractor = ((Stack<String>) workflowContext.extractors).peek();
    --- End diff --
    
    @jgrzebyta IMHO, it would be vastly more straightforward to simply have the user extend the [`CompositeTripleHandler`](https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/writer/CompositeTripleHandler.java) class to filter and transform triples into a domain-specific rdf graph of their choosing, before delegating the final domain-specific triple outputs to the wrapped `TripleHandler` instance(s) by calling `super.receiveTriple( [modified triple] )`.
    
    (Analogous in concept to Java's own [`FilterOutputStream`](https://docs.oracle.com/javase/8/docs/api/java/io/FilterOutputStream.html) class.)



---

[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

Posted by HansBrende <gi...@git.apache.org>.

Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/121
  
    @jgrzebyta One easy solution to my above comment that I can think of right off the bat is as follows:
    
    First, we could extend the WriterFactory interface as follows (or similar):
    
    ```
    interface DelegatingWriterFactory extends WriterFactory {
        TripleHandler getWriter(TripleHandler delegate);
    }
    ```
    
    Second, in the rover `--format` flag (which actually accepts a WriterFactory *id*, not necessarily a format name), we could simply allow a comma-separated *list* of WriterFactory ids rather than a single id. Then, to construct the final writer, we'd compose each writer in the list with the previous one, i.e.:
    
    ```
    Collections.reverse(listOfIds);
    
    tripleHandler = getWriterFactoryForId(listOfIds.get(0)).getRdfWriter(outputStream);
    
    for (String id : listOfIds.subList(1, listOfIds.size())) {
        tripleHandler = ((DelegatingWriterFactory)getWriterFactoryForId(id)).getWriter(tripleHandler);
    }
    ```
    This is just one initial idea, but food for thought. It also seems more in line with your concept of a "flow".
    
    What do you think?



---

[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

Posted by HansBrende <gi...@git.apache.org>.

Github user HansBrende commented on a diff in the pull request:

    https://github.com/apache/any23/pull/121#discussion_r217041561
  
    --- Diff: core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java ---
    @@ -483,6 +488,14 @@ private SingleExtractionReport runExtractor(
                             documentReport.getDocument(),
                             extractionResult
                     );
    +            } else if (extractor instanceof ModelExtractor) {
    +                final ModelExtractor modelExtractor = (ModelExtractor) extractor;
    +                final Model singleModel = BufferedTripleHandler.getModel();
    --- End diff --
    
    Should not be static.


---

[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

Posted by HansBrende <gi...@git.apache.org>.

Github user HansBrende commented on a diff in the pull request:

    https://github.com/apache/any23/pull/121#discussion_r217037875
  
    --- Diff: core/src/main/java/org/apache/any23/writer/BufferedTripleHandler.java ---
    @@ -0,0 +1,161 @@
    +package org.apache.any23.writer;
    +
    +import com.google.common.base.Throwables;
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.eclipse.rdf4j.model.Model;
    +import org.eclipse.rdf4j.model.Resource;
    +import org.eclipse.rdf4j.model.Value;
    +import org.eclipse.rdf4j.model.impl.LinkedHashModelFactory;
    +import org.eclipse.rdf4j.model.impl.TreeModelFactory;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.util.Map;
    +import java.util.Stack;
    +import java.util.TreeMap;
    +
    +/**
    + * Collects all statements until end document.
    + *
    + * All statements are kept within {@link Model}.
    + *
    + * @author Jacek Grzebyta (jgrzebyta@apache.org)
    + */
    +public class BufferedTripleHandler implements TripleHandler {
    +
    +    private static final Logger log = LoggerFactory.getLogger(BufferedTripleHandler.class);
    +    private TripleHandler underlying;
    +    private static boolean isDocumentFinish = false;
    +
    +    private static class ContextHandler {
    +        ContextHandler(ExtractionContext ctx, Model m) {
    +            extractionContext = ctx;
    +            extractionModel = m;
    +        }
    +        ExtractionContext extractionContext;
    +        Model extractionModel;
    +    }
    +
    +    private static class WorkflowContext {
    +        WorkflowContext(TripleHandler underlying) {
    +            this.rootHandler = underlying;
    +        }
    +
    +
    +        Stack<String> extractors = new Stack<>();
    +        Map<String, ContextHandler> modelMap = new TreeMap<>();
    +        IRI documentIRI = null;
    +        TripleHandler rootHandler ;
    +    }
    +
    +    public BufferedTripleHandler(TripleHandler underlying) {
    +        this.underlying = underlying;
    +
    +        // hide model in the thread
    +        WorkflowContext wc = new WorkflowContext(underlying);
    +        BufferedTripleHandler.workflowContext.set(wc);
    +    }
    +
    +    private static final ThreadLocal<WorkflowContext> workflowContext = new ThreadLocal<>();
    --- End diff --
    
    Model should not be static, unless there is a very good reason for doing so?


---

[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

Posted by jgrzebyta <gi...@git.apache.org>.

Github user jgrzebyta closed the pull request at:

    https://github.com/apache/any23/pull/121


---

[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

Posted by jgrzebyta <gi...@git.apache.org>.

Github user jgrzebyta commented on a diff in the pull request:

    https://github.com/apache/any23/pull/121#discussion_r217143848
  
    --- Diff: core/src/main/java/org/apache/any23/writer/BufferedTripleHandler.java ---
    @@ -0,0 +1,161 @@
    +package org.apache.any23.writer;
    +
    +import com.google.common.base.Throwables;
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.eclipse.rdf4j.model.Model;
    +import org.eclipse.rdf4j.model.Resource;
    +import org.eclipse.rdf4j.model.Value;
    +import org.eclipse.rdf4j.model.impl.LinkedHashModelFactory;
    +import org.eclipse.rdf4j.model.impl.TreeModelFactory;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.util.Map;
    +import java.util.Stack;
    +import java.util.TreeMap;
    +
    +/**
    + * Collects all statements until end document.
    + *
    + * All statements are kept within {@link Model}.
    + *
    + * @author Jacek Grzebyta (jgrzebyta@apache.org)
    + */
    +public class BufferedTripleHandler implements TripleHandler {
    +
    +    private static final Logger log = LoggerFactory.getLogger(BufferedTripleHandler.class);
    +    private TripleHandler underlying;
    +    private static boolean isDocumentFinish = false;
    +
    +    private static class ContextHandler {
    +        ContextHandler(ExtractionContext ctx, Model m) {
    +            extractionContext = ctx;
    +            extractionModel = m;
    +        }
    +        ExtractionContext extractionContext;
    +        Model extractionModel;
    +    }
    +
    +    private static class WorkflowContext {
    +        WorkflowContext(TripleHandler underlying) {
    +            this.rootHandler = underlying;
    +        }
    +
    +
    +        Stack<String> extractors = new Stack<>();
    +        Map<String, ContextHandler> modelMap = new TreeMap<>();
    +        IRI documentIRI = null;
    +        TripleHandler rootHandler ;
    +    }
    +
    +    public BufferedTripleHandler(TripleHandler underlying) {
    +        this.underlying = underlying;
    +
    +        // hide model in the thread
    +        WorkflowContext wc = new WorkflowContext(underlying);
    +        BufferedTripleHandler.workflowContext.set(wc);
    +    }
    +
    +    private static final ThreadLocal<WorkflowContext> workflowContext = new ThreadLocal<>();
    +
    +    /**
    +     * Returns model which contains all other models.
    +     * @return
    +     */
    +    public static Model getModel() {
    +        return BufferedTripleHandler.workflowContext.get().modelMap.values().stream()
    +                .map(ch -> ch.extractionModel)
    +                .reduce(new LinkedHashModelFactory().createEmptyModel(), (mf, exm) -> {
    +                    mf.addAll(exm);
    +                    return mf;
    +                });
    +    }
    +
    +    @Override
    +    public void startDocument(IRI documentIRI) throws TripleHandlerException {
    +        BufferedTripleHandler.workflowContext.get().documentIRI = documentIRI;
    +    }
    +
    +    @Override
    +    public void openContext(ExtractionContext context) throws TripleHandlerException {
    +        //
    +    }
    +
    +    @Override
    +    public void receiveTriple(Resource s, IRI p, Value o, IRI g, ExtractionContext context) throws TripleHandlerException {
    +        getModelForContext(context).add(s,p,o,g);
    +    }
    +
    +    @Override
    +    public void receiveNamespace(String prefix, String uri, ExtractionContext context) throws TripleHandlerException {
    +        getModelForContext(context).setNamespace(prefix, uri);
    +    }
    +
    +    @Override
    +    public void closeContext(ExtractionContext context) throws TripleHandlerException {
    +        //
    +    }
    +
    +    @Override
    +    public void endDocument(IRI documentIRI) throws TripleHandlerException {
    +        BufferedTripleHandler.isDocumentFinish = true;
    +    }
    +
    +    @Override
    +    public void setContentLength(long contentLength) {
    +        underlying.setContentLength(contentLength);
    +    }
    +
    +    @Override
    +    public void close() throws TripleHandlerException {
    +        underlying.close();
    +    }
    +
    +    /**
    +     * Releases content of the model into underlying writer.
    +     */
    +    public static void releaseModel() throws TripleHandlerException {
    +        if(!BufferedTripleHandler.isDocumentFinish) {
    +            throw new RuntimeException("Before releasing document should be finished.");
    +        }
    +
    +        WorkflowContext workflowContext = BufferedTripleHandler.workflowContext.get();
    +
    +        String lastExtractor = ((Stack<String>) workflowContext.extractors).peek();
    --- End diff --
    
    The idea is to find extractor which produces theustomer's domain-specific rdf graph.


---

[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

Posted by HansBrende <gi...@git.apache.org>.

Github user HansBrende commented on a diff in the pull request:

    https://github.com/apache/any23/pull/121#discussion_r217044167
  
    --- Diff: cli/src/main/java/org/apache/any23/cli/Rover.java ---
    @@ -172,6 +174,8 @@ protected void configure() {
                                                  defaultns);
             }
     
    +        extractionParameters.setFlag(ExtractionParameters.EXTRACTION_WORKFLOWS_FLAG, workflow);
    --- End diff --
    
    We should not need a separate flag to enable certain extractors. If an extractor is contained within the extractor group we are using, then that should be, on its own, enough to enable itself.


---

[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

Posted by HansBrende <gi...@git.apache.org>.

Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/121
  
    @jgrzebyta please check out [this PR](https://github.com/apache/any23/pull/122), which is an implementation of what I've just described.
    
    It seems like a lot simpler and less error-prone way to produce a domain-specific rdf graph.
    
    Eager to know your thoughts!


---

[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

Posted by HansBrende <gi...@git.apache.org>.

Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/121
  
    Another thought:
    
    Using the TripleHandler interface (as intended) to transform triples, rather than a separate ModelExtractor, has the added advantage that the triples might not necessarily need to be stored in memory during the transformation process. The user could implement either a "collecting" triple handler which stores statements in memory prior to transforming them, or a "streaming" triple handler for transformation-on-the-fly (e.g., if mapping some predicate A to some other predicate B), or some combination of these two concepts. The "collecting" ability could be easily supplemented with a `ModelWriter` or equivalent, as in [ANY23-397](https://issues.apache.org/jira/browse/ANY23-397).
    
    But adding a separate "ModelExtractor" concept only muddles this already-existing ability to transform triples with TripleHandlers by introducing a redundant construct of more limited abstraction power than what already exists.
    
    So for me:
    -1 for ANY23-396
    +1 for ANY23-397
    
    @lewismc any thoughts?


---

[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

Posted by HansBrende <gi...@git.apache.org>.

Github user HansBrende commented on a diff in the pull request:

    https://github.com/apache/any23/pull/121#discussion_r217044600
  
    --- Diff: api/src/main/resources/default-configuration.properties ---
    @@ -76,3 +76,6 @@ any23.extraction.csv.comment=#
     # A confidence threshold for the OpenIE extractions
     # Any extractions below this value will not be processed.
     any23.extraction.openie.confidence.threshold=0.5
    +
    +# Allows to enable(on)/disable(off) the workflow feature
    +any23.extraction.workflows=off
    --- End diff --
    
    No extra flag should be needed for this.


---

[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

Posted by jgrzebyta <gi...@git.apache.org>.

Github user jgrzebyta commented on the issue:

    https://github.com/apache/any23/pull/121
  
    @HansBrende ok. Thanks.


---

[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

Posted by HansBrende <gi...@git.apache.org>.

Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/121
  
    @jgrzebyta But as far as rover goes, you're right: we currently don't have support for using an arbitrary triple handler. Looks like it expects an RDFFormat and then finds a triple handler based on that. 
    
    I wonder if it would be possible to allow a more flexible way to specify triple handlers as rover arguments to fix this problem? While I don't think that creating a `ModelExtractor` as currently defined in this PR is the way to go, I do think that rover needs to be improved in this respect.
    
    I will think on this.


---

[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

Posted by jgrzebyta <gi...@git.apache.org>.

Github user jgrzebyta commented on the issue:

    https://github.com/apache/any23/pull/121
  
    I guess I have finished the first version. Any comments? 


---

[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

Posted by jgrzebyta <gi...@git.apache.org>.

Github user jgrzebyta commented on a diff in the pull request:

    https://github.com/apache/any23/pull/121#discussion_r217142024
  
    --- Diff: core/src/main/java/org/apache/any23/writer/BufferedTripleHandler.java ---
    @@ -0,0 +1,161 @@
    +package org.apache.any23.writer;
    +
    +import com.google.common.base.Throwables;
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.eclipse.rdf4j.model.Model;
    +import org.eclipse.rdf4j.model.Resource;
    +import org.eclipse.rdf4j.model.Value;
    +import org.eclipse.rdf4j.model.impl.LinkedHashModelFactory;
    +import org.eclipse.rdf4j.model.impl.TreeModelFactory;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.util.Map;
    +import java.util.Stack;
    +import java.util.TreeMap;
    +
    +/**
    + * Collects all statements until end document.
    + *
    + * All statements are kept within {@link Model}.
    + *
    + * @author Jacek Grzebyta (jgrzebyta@apache.org)
    + */
    +public class BufferedTripleHandler implements TripleHandler {
    +
    +    private static final Logger log = LoggerFactory.getLogger(BufferedTripleHandler.class);
    +    private TripleHandler underlying;
    +    private static boolean isDocumentFinish = false;
    +
    +    private static class ContextHandler {
    +        ContextHandler(ExtractionContext ctx, Model m) {
    +            extractionContext = ctx;
    +            extractionModel = m;
    +        }
    +        ExtractionContext extractionContext;
    +        Model extractionModel;
    +    }
    +
    +    private static class WorkflowContext {
    +        WorkflowContext(TripleHandler underlying) {
    +            this.rootHandler = underlying;
    +        }
    +
    +
    +        Stack<String> extractors = new Stack<>();
    +        Map<String, ContextHandler> modelMap = new TreeMap<>();
    +        IRI documentIRI = null;
    +        TripleHandler rootHandler ;
    +    }
    +
    +    public BufferedTripleHandler(TripleHandler underlying) {
    +        this.underlying = underlying;
    +
    +        // hide model in the thread
    +        WorkflowContext wc = new WorkflowContext(underlying);
    +        BufferedTripleHandler.workflowContext.set(wc);
    +    }
    +
    +    private static final ThreadLocal<WorkflowContext> workflowContext = new ThreadLocal<>();
    --- End diff --
    
    Yes I agree with you. The idea is that these models should be presented later on (inside SingleDocumentWriter) to ModelExtractor and conrain parsing outcome of previous extractors. Access to those models is not propagated further down from Rover. I meant without changing api. I thought adding them into extraction parameters.


---

[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

Posted by HansBrende <gi...@git.apache.org>.

Github user HansBrende commented on a diff in the pull request:

    https://github.com/apache/any23/pull/121#discussion_r217041300
  
    --- Diff: core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java ---
    @@ -295,6 +294,12 @@ public SingleDocumentExtractionReport run(ExtractionParameters extractionParamet
             } finally {
     	        try {
     	            output.endDocument(documentIRI);
    +
    +	            // in case of workflow flag release data from model
    +                if (extractionParameters.getFlag(ExtractionParameters.EXTRACTION_WORKFLOWS_FLAG)) {
    +                    BufferedTripleHandler.releaseModel();
    --- End diff --
    
    This should not be a static call.


---

[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

Posted by HansBrende <gi...@git.apache.org>.

Github user HansBrende commented on a diff in the pull request:

    https://github.com/apache/any23/pull/121#discussion_r217042421
  
    --- Diff: core/src/main/java/org/apache/any23/writer/BufferedTripleHandler.java ---
    @@ -0,0 +1,161 @@
    +package org.apache.any23.writer;
    +
    +import com.google.common.base.Throwables;
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.eclipse.rdf4j.model.Model;
    +import org.eclipse.rdf4j.model.Resource;
    +import org.eclipse.rdf4j.model.Value;
    +import org.eclipse.rdf4j.model.impl.LinkedHashModelFactory;
    +import org.eclipse.rdf4j.model.impl.TreeModelFactory;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.util.Map;
    +import java.util.Stack;
    +import java.util.TreeMap;
    +
    +/**
    + * Collects all statements until end document.
    + *
    + * All statements are kept within {@link Model}.
    + *
    + * @author Jacek Grzebyta (jgrzebyta@apache.org)
    + */
    +public class BufferedTripleHandler implements TripleHandler {
    +
    +    private static final Logger log = LoggerFactory.getLogger(BufferedTripleHandler.class);
    +    private TripleHandler underlying;
    +    private static boolean isDocumentFinish = false;
    +
    +    private static class ContextHandler {
    +        ContextHandler(ExtractionContext ctx, Model m) {
    +            extractionContext = ctx;
    +            extractionModel = m;
    +        }
    +        ExtractionContext extractionContext;
    +        Model extractionModel;
    +    }
    +
    +    private static class WorkflowContext {
    +        WorkflowContext(TripleHandler underlying) {
    +            this.rootHandler = underlying;
    +        }
    +
    +
    +        Stack<String> extractors = new Stack<>();
    +        Map<String, ContextHandler> modelMap = new TreeMap<>();
    +        IRI documentIRI = null;
    +        TripleHandler rootHandler ;
    +    }
    +
    +    public BufferedTripleHandler(TripleHandler underlying) {
    +        this.underlying = underlying;
    +
    +        // hide model in the thread
    +        WorkflowContext wc = new WorkflowContext(underlying);
    +        BufferedTripleHandler.workflowContext.set(wc);
    +    }
    +
    +    private static final ThreadLocal<WorkflowContext> workflowContext = new ThreadLocal<>();
    +
    +    /**
    +     * Returns model which contains all other models.
    +     * @return
    +     */
    +    public static Model getModel() {
    +        return BufferedTripleHandler.workflowContext.get().modelMap.values().stream()
    +                .map(ch -> ch.extractionModel)
    +                .reduce(new LinkedHashModelFactory().createEmptyModel(), (mf, exm) -> {
    +                    mf.addAll(exm);
    +                    return mf;
    +                });
    +    }
    +
    +    @Override
    +    public void startDocument(IRI documentIRI) throws TripleHandlerException {
    +        BufferedTripleHandler.workflowContext.get().documentIRI = documentIRI;
    +    }
    +
    +    @Override
    +    public void openContext(ExtractionContext context) throws TripleHandlerException {
    +        //
    +    }
    +
    +    @Override
    +    public void receiveTriple(Resource s, IRI p, Value o, IRI g, ExtractionContext context) throws TripleHandlerException {
    +        getModelForContext(context).add(s,p,o,g);
    +    }
    +
    +    @Override
    +    public void receiveNamespace(String prefix, String uri, ExtractionContext context) throws TripleHandlerException {
    +        getModelForContext(context).setNamespace(prefix, uri);
    +    }
    +
    +    @Override
    +    public void closeContext(ExtractionContext context) throws TripleHandlerException {
    +        //
    +    }
    +
    +    @Override
    +    public void endDocument(IRI documentIRI) throws TripleHandlerException {
    +        BufferedTripleHandler.isDocumentFinish = true;
    +    }
    +
    +    @Override
    +    public void setContentLength(long contentLength) {
    +        underlying.setContentLength(contentLength);
    +    }
    +
    +    @Override
    +    public void close() throws TripleHandlerException {
    +        underlying.close();
    +    }
    +
    +    /**
    +     * Releases content of the model into underlying writer.
    +     */
    +    public static void releaseModel() throws TripleHandlerException {
    +        if(!BufferedTripleHandler.isDocumentFinish) {
    +            throw new RuntimeException("Before releasing document should be finished.");
    +        }
    +
    +        WorkflowContext workflowContext = BufferedTripleHandler.workflowContext.get();
    +
    +        String lastExtractor = ((Stack<String>) workflowContext.extractors).peek();
    --- End diff --
    
    Feels hacky... what if not all of the triples came from the same extractor?


---

[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

Posted by HansBrende <gi...@git.apache.org>.

Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/121
  
    Now that ANY23-396 has been implemented in #122 and merged into master, can we close this PR? @lewismc ? @jgrzebyta ? I don't have the required permissions to close issues myself.


---