You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by HansBrende <gi...@git.apache.org> on 2018/09/14 15:51:24 UTC

[GitHub] any23 pull request #122: ANY23-396 allow mapping/filtering TripleHandlers in...

GitHub user HansBrende opened a pull request:

    https://github.com/apache/any23/pull/122

    ANY23-396 allow mapping/filtering TripleHandlers in Rover

    Here is one possible alternative to the existing PR for ANY23-396.
    
    **Pros:**
    
    1. Fully backwards compatible
    2. Extends `WriterFactory` with new `DelegatingWriterFactory` interface, which, rather than writing a `TripleHandler` to an output stream, writes a `TripleHandler` to another `TripleHandler`. This will allow users to produce a final domain-specific RDF graph of their choosing in Rover by implementing mapping/filtering `DelegatingWriterFactory` implementations. 
    3. the `--format` flag in rover now represents a list of WriterFactory ids, rather than a single WriterFactory id. Each id in the list is composed with the one previous to it to construct the final `TripleHandler`. All writers in the list, except the last, are required to implement `DelegatingTripleHandler`.
    
    **Cons:** 
    1. this solution requires deprecating 3 methods in the `WriterFactory` interface (and then un-deprecating them in the extending `FormatWriterFactory` interface.) However, this drawback does not affect backwards compatibility. 
    
    ## ALTERNATIVE
    
    In order to avoid the single "con" I have listed, the alternative to this solution would be, rather than extending the `WriterFactory` interface with `DelegatingWriterFactory`, to keep these two interfaces completely separate and define a new `DelegatingWriterFactoryRegistry` (analogous to the `WriterFactoryRegistry`) with a different `ServiceLoader` in order to load `DelegatingWriterFactory` implementations.
    
    @jgrzebyta @lewismc Thoughts? 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HansBrende/any23 ANY23-396

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/any23/pull/122.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #122
    
----
commit cb293e22b9352652d91474cd3e35233e75dc9fb9
Author: Hans <fi...@...>
Date:   2018-09-14T15:29:33Z

    ANY23-396 allow mapping/filtering TripleHandlers in Rover

----


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    @lewismc , great to hear!
    
    After I do a bit of last-minute cleanup, I will merge this PR.
    
    Cleanup item 1: I'm renaming `TripleFormat.ExtendedCapabilities` to `TripleFormat.FineCapabilities`, as I think the first name is a slightly misleading. 
    
    Cleanup item 2: For now, I'm removing the newly added support for configuring a writer's charset via `Settings` (although we could take another look at doing this in a future issue), for 3 reasons:  
    
    1. Some XML-based writers hard-code a "encoding=utf8" declaration which might then conflict with user-supplied charset and produce an invalid document.
    2. If the user sets a writer's charset to US-ASCII or similar, that could create a problem if the writer doesn't support escaping non-ascii characters. (To my knowledge, only the `NTriplesWriter` and `NQuadsWriter` support this.)
    3. The default charset for every existing writer is already UTF-8, and I can't think of a good reason to support anything else.


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    @lewismc One quick question for you before I finish up here: according to RDF specifications, graph names are allowed to be blank nodes, but it appears that Any23 only supports graph names that are IRIs. (Whereas RDF4J supports graph names that are blank nodes *or* IRIs. It appears Any23 silently drops any parsed graph names that are blank nodes rather than IRIs.)
    
    Is there a reason for this? Any historical context you can give me on why Any23 opted to not support BNode graph names? Should we lean towards supporting this in the future?


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    @lewismc I just added some unit tests & javadoc for the `Settings` API. Let me know your thoughts!


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    On second thought, I'm thinking about making `TripleWriter` a class in `core` rather than an interface in `api`. It would have the same effect as before, except it would allow more freedom of implementation from the api perspective.


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    For the time being, I've opted for the third option for (4) (making the public methods I added to `WriterFactoryRegistry` become `private static` methods in `Rover`) so that I don't have to deal with that naming issue in this PR. If we want to add extra utility methods in `WriterFactoryRegistry`, that can be the subject of a different JIRA issue.
    
    However, I did have to fix a couple of synchronization issues in `WriterFactoryRegistry` to accomplish this: I noticed that iterating through the list of writer factories returned by `WriterFactoryRegistry.getWriters()` could potentially throw a `ConcurrentModificationException` even though that method was marked `synchronized` (because, unless I am mistaken, the underlying list implementation *can* be modified after access to the list is given to a caller and the method returns). To fix this problem, I changed the implementation of the backing list of writers from `ArrayList` to `CopyOnWriteArrayList`, which guarantees thread safety for iterators. Since writes to `CopyOnWriteArrayList` are relatively expensive, I also changed the logic around a bit to use *batch writing*, i.e., registering all `WriterFactory` instances at once in a `registerAll()` method, rather than through consecutive invocations of the `register()` method. Similar issues existed for the methods to retrieve id
 entifiers and mime types, which I fixed in the same manner.
    
    With this last commit, I am now satisfied, personally, with my implementation of ANY23-396.
    
    Anything else, @lewismc @jgrzebyta ?


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by lewismc <gi...@git.apache.org>.
Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/122
  
    > The only reason I've hesitated so far about merging this PR is that once new interfaces are introduced as part of the core API, I'd prefer to never change them again--so I want to get it right the first time!
    
    Totally understandable
    
    I really like the idea of improving the interoperability logic here. 
    
    > The reason being: shouldn't all return types...be, preferably, part of our own API, rather than RDF4J's?
    
    Yes that is a very fair statement. What, for example happens if and when the RDF4J project ceases to exist? Then of course we would end up implementing exactly what you are suggesting. 
    
    I am certainly +1 to a TripleFormat API which is somewhat analogous and certainly interoperable with RDF4J's [RDFFormat](http://docs.rdf4j.org/javadoc/latest/index.html?org/eclipse/rdf4j/rio/RDFFormat.html). Very nice @HansBrende 
    



---

[GitHub] any23 pull request #122: ANY23-396 allow mapping/filtering TripleHandlers in...

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on a diff in the pull request:

    https://github.com/apache/any23/pull/122#discussion_r226702275
  
    --- Diff: api/src/main/java/org/apache/any23/configuration/Setting.java ---
    @@ -118,7 +128,22 @@ private Type getValueType() {
                 }
             }
     
    -        protected abstract V checkedValue(Setting<V> original, V newValue) throws Exception;
    +        /**
    +         * Subclasses may override this method to check that new settings for this key are valid,
    +         * and/or to decorate new setting values, using, for example, {@link Collections#unmodifiableList(List)}.
    +         * The default implementation of this method throws a {@link NullPointerException} if the new value is null and the initial value was non-null.
    +         *
    +         * @param initial the setting containing the initial value for this key, or null if the setting has not yet been initialized
    +         * @param newValue the new value for this setting
    +         * @return the new value for this setting
    +         * @throws Exception if the new value for this setting was invalid
    +         */
    +        protected V checkedValue(Setting<V> initial, V newValue) throws Exception {
    +            if (newValue == null && initial != null && initial.value != null) {
    +                throw new NullPointerException();
    +            }
    +            return newValue;
    +        }
    --- End diff --
    
    Actually, we should not allow keys to decorate values. Consider the following scenario: user copies the value from one setting into another setting. Now the key is decorating a value that has *already been decorated*. This could lead to an unfortunate chain of, e.g., 
    ```
    Collections.unmodifiableList(Collections.unmodifiableList(Collections.unmodifiableList(... )))
    ```
    
    Therefore, any decorating should happen *before* the setting is created, and if the value is not appropriately decorated, the key should throw an exception in the value check. 
    
    Also we should change this method's signature to:
    ```
    protected void checkValue(Setting<V> initial, V newValue) throws Exception
    ```


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    @lewismc the reason I ask is that my **cleanup item 4** allows me to specify a new method in `TripleWriter` which accepts a group as a `Resource` rather than as an `IRI`. That's what I've done in my latest cleanup commit. 
    
    I'll leave this PR open for at least another day before merging to master just in case anyone comes up with any further comments or concerns.


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    While we're deprecating things anyway, there are a few notes of interest which we should mull over before any merges into master happen here:
    
    1. The method `WriterFactory.getMimeType()` appears to be redundant, as there also exists the `WriterFactory.getRdfFormat().getDefaultMimeType()`.
    
    2. Also note the presence of `FileFormat`, the superclass of `RDFFormat`, which we could possibly use to make the `FormatWriterFactory` interface more generic (possibly helpful for the `URIListWriterFactory`, and other writer factories which similarly do not print RDF triples as output).
    
    3. The method `WriterFactory.getRdfFormat()` is never actually used anywhere in the Any23 project.
    
    4. `JSONWriterFactory` and `URIListWriterFactory` both throw `RuntimeException` in the `getRdfFormat()` method (the former questionably so, since there exists the `RDFFormat.RDFJSON` file format).


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    As per part 4 of my last comment, in my most recent commit I've allowed `FormatWriterFactory` and `DelegatingWriterFactory` to extend the same (package-private) base interface specifying a single generic method:
    
    ```java
    interface BaseWriterFactory<Output> extends WriterFactory {
        TripleHandler getTripleWriter(Output o);
    }
    ```
    
    I could have added this method directly to `WriterFactory` (with a default implementation of throwing `UnsupportedOperationException`), but since all instances of this interface *must* be instances of either `FormatWriterFactory` or `DelegatingWriterFactory` (since the interface is package-private), and all interaction with this method will be done by casting to one of these two interfaces, adding generic arguments to `WriterFactory` itself would have only added unnecessary verbosity (e.g., always having to specify `WriterFactory<?>` instead of `WriterFactory` to avoid rawtypes warnings).
    
    @lewismc any comments?


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    My last commit reflects the notes of interest I mentioned in my last comment.
    1. Since `WriterFactory.getMimeType()` is redundant and I had to deprecate it anyway to make this PR work, I've simply opted to *not* un-deprecate it in the extending `FormatWriterFactory`. To retrieve the MIME type of a `FormatWriterFactory` instance, simply call `getFormat().getDefaultMIMEType()`. However, to keep new implementations of `FormatWriterFactory` backwards compatible with the older behavior, I've simply added the following default implementation of `getMimeType()` in the `FormatWriterFactory` interface:
    
    ```java
    @Override
    @Deprecated
    default String getMimeType() {
        return getFormat().getDefaultMIMEType();
    }
    ```
    
    2. Since not all implementations of `FormatWriterFactory` print RDF triples (case in point: `URIListWriterFactory`), the deprecation of `WriterFactory.getRdfFormat()` presents us with the perfect opportunity to make the return type of `getRdfFormat()` more generic in `FormatWriterFactory` (namely, using `FileFormat`, the superclass of `RDFFormat`, instead of `RDFFormat`). To accomplish this, I've simply opted to *not* un-deprecate the `getRdfFormat()` method in the `FormatWriterFactory` interface, and instead, add the following method:
    
    ```java
    FileFormat getFormat();
    ```
    
    To keep everything backwards compatible with the previous behavior, I've added the following default implementation of `getRdfFormat()` to the `FormatWriterFactory` interface:
    
    ```java
    @Override
    @Deprecated
    default RDFFormat getRdfFormat() {
        FileFormat f = getFormat();
        if (f instanceof RDFFormat) {
            return (RDFFormat)f;
        } else {
            throw new UnsupportedOperationException("This class does not print RDF triples.");
        }
    }
    ```
    Now the `URIListWriterFactory` can utilize the method `getFormat()`, instead of its previous behavior of throwing a `RuntimeException`. To that effect, I've opted to return the following `FileFormat` from `URIListWriterFactory.getFormat()`:
    
    ```java
    private static final FileFormat FORMAT = new FileFormat("PLAINTEXT", "text/plain", 
                                               StandardCharsets.UTF_8, "txt");
    @Override
    public FileFormat getFormat() {
        return FORMAT;
    }
    ```
    
    3. Since the `FormatWriterFactory` interface is now not only tasked with `RDFFormat`s, but also arbitrary `FileFormat`s, deprecating the `WriterFactory.getRdfWriter(OutputStream)` method presents us with the perfect opportunity to choose a more appropriate name for this method in the subinterface `FormatWriterFactory`. To this effect, I've opted to simply *not* un-deprecate the `FormatWriterFactory.getRdfWriter(OutputStream)` method, and instead choose a more appropriate name. The name I've provisionally opted for is:
    
    ```java
    FormatWriter getFormatWriter(OutputStream);
    ```
    To keep everything backwards compatible, I've added the following default implementation of `FormatWriterFactory.getRdfWriter(OutputStream)`:
    
    ```java
    @Override
    @Deprecated
    default FormatWriter getRdfWriter(OutputStream os) {
        return getFormatWriter(os);
    }
    ```
    
    4. Finally, I have one further question for discussion:
    We could use this deprecation opportunity to further genericize the `FormatWriterFactory.getFormatWriter(OutputStream)` method, replacing:
    ```java
    FormatWriter getFormatWriter(OutputStream);
    ```
    with:
    ```java
    TripleHandler getWriter(OutputStream);
    ```
    which would allow `FormatWriterFactory` implementations to return arbitrary `TripleHandler`s instead of forcing them to return the more specific (but arguably *not* more useful) `FormatWriter` implementations. Where behavior specific to `FormatWriter` is actually needed, e.g. `FormatWriter.isAnnotated()` (a method which is actually *never* used anywhere in Any23), a check could be added as follows: 
    ```java
    boolean isAnnotated(TripleHandler writer) {
        return writer instanceof FormatWriter ? ((FormatWriter)writer).isAnnotated() : false;
    }
    ```
    
    One additional benefit of doing this would be that `DelegatingWriterFactory` and `FormatWriterFactory` could both then extend some base interface as follows:
    ```
    interface BaseWriterFactory<Output> {
        TripleHandler getWriter(Output);
    }
    interface FormatWriterFactory extends BaseWriterFactory<OutputStream> {
        ...
    }
    interface DelegatingWriterFactory extends BaseWriterFactory<TripleHandler> {
        ...
    }
    ```
    
    @lewismc Any comments?


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    @lewismc do you have any concerns or questions regarding my latest commit? Would love to hear your thoughts.


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    **Implementation note:** I considered using the existing `Configuration` interface to construct `TripleWriter` instances, but it seemed rather limited, in that settings are only validated when they are first used, rather than *failing fast*, and they are all stored as strings rather than the actual parsed objects they represent. This is good for settings imported from a config file or loaded from the command line, but not very easy, type-safe, or performant for programmatic configuration.
    
    So instead, I created `Settings`, which could be considered a type-safe version of `Configuration`, or a *parsed* configuration. In the future, we could add the ability to create a `Settings` object *from* a `Configuration` object, given a set of supported settings and a configuration parser. In a future PR, I'm planning to implement a similar concept for `Rover`, so that a `Settings` object can be parsed from the command line for each writer. E.g., instead of having, simply:
    
    ```
    --format mycustomdecorator,notrivial,turtle
    ```
    we could do something like:
    ```
    --format mycustomdecorator,notrivial;alwayssuppresscsstriples=true,turtle;prettyprint=true
    ```


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    @lewismc I've implemented a few new things here for the new `TripleWriterFactory` (what I used to call `FormatWriterFactory`). Most important of these, in my opinion, being `Settings`, which allows you to configure writers. (Analogous to rdf4j's `RioSetting` api, but with several improvements.) The `Settings` capability will be able to replace the existing solution for ANY23-388 (PR #117 ). Also, we'll finally be able to allow users to turn off pretty printing if they so choose, or any other configuration option they desire. (E.g., when we upgrade to rdf4j 2.4.0, we can add a "hierarchical" settings option for the new hierarchical JSON-LD printing ability.)
    
    Then there's the new `TripleFormat` class, analogous to rdf4j's `RDFFormat` class with a few improvements (one being a "characteristics" flag which allows a much broader range of boolean characteristics to be specified than the 2 in `RDFFormat`.)
    
    I'm also deprecating the `FormatWriter` interface (which is nearly useless as it stands--and could be replaced in the future with a simple `AnnotatingDelegatingWriter`) in favor of the new `TripleWriter` interface (which extends `FormatWriter` for backwards compatibility, but introduces methods that are more useful).
    
    Let me know what you think!


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    Thank you, @lewismc .
    
    The only item of concern left from my perspective is *naming*. Should any of the new `public` interfaces/methods I have created be named differently, or are they adequately descriptive as they currently stand? This decision should be made now, as there is no going back.
    
    Here follows the names of all the new `public` methods/interfaces I have created in this PR:
    1. `public interface `**`FormatWriterFactory`**
    > Is this descriptive enough? It does specify a `FileFormat getFormat()` method, returning the format which will be written to the output stream, so the name does still make sense even though we now return a `TripleHandler` rather than a `FormatWriter` from the `getTripleWriter(OutputStream)` method. On the other hand, we could also call it `ContentWriterFactory` in line with the existing `ContentExtractor` interface (although I'm not sure if that would make it any more descriptive). Another possibility would be `OutputStreamWriterFactory`.
    2. `public interface`**`DelegatingWriterFactory`**
    > Alternatives include `CompositeWriterFactory` or `FilterWriterFactory` (similar to `java.io.FilterOutputStream`).
    3. `TripleHandler`**`getTripleWriter(Output)`** (specified in the `BaseWriterFactory<Output>` interface)
    > Alternatives include `getTripleHandler` or simply `getWriter`. I chose `getTripleWriter` over `getWriter` because it seemed more descriptive, and to avoid confusion with the `java.io.Writer` class.
    4. `TripleHandler`**`getWriter(id, output)`** and `TripleHandler`**`getDefaultWriter(OutputStream)`** (specified in `WriterFactoryRegistry`).
    > This one is confused by the fact that `WriterFactoryRegistry` already uses the term "writer" to refer to *`WriterFactory`* instances (e.g. `List<WriterFactory> getWriters()` and `WriterFactory getWriterByIdentifier(String id)`). An easy alternative would be to take a hint from the existing, now-deprecated method `FormatWriter getWriterInstanceByIdentifier(id, output)` and use "**writerInstance**" to refer to a triple handler, i.e., `TripleHandler getWriterInstance(id, output)` and `getDefaultWriterInstance(OutputStream)`. Alternatively, we could use `getTripleWriter(id, output)` and `getDefaultTripleWriter(OutputStream)`.
    
    Any suggestions, or better names that I haven't thought of, @lewismc ? @jgrzebyta ?


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by lewismc <gi...@git.apache.org>.
Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/122
  
    > would you prefer I squash my commits before merging, or just merge everything in as-is?
    
    squash would be great :)


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by jgrzebyta <gi...@git.apache.org>.
Github user jgrzebyta commented on the issue:

    https://github.com/apache/any23/pull/122
  
    @HansBrende  Currently my other project which uses ANY23 is hanged and I am busy now with other thing. You have +1 from me now (after adding documentation) and I will add my unit in separate ticket.


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    @lewismc If you don't have any more comments, I'm ready to merge this in.
    
    One question: would you prefer I squash the commits before merging, or not?


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    @lewismc One possibility that I'm considering right now is using this opportunity to define our own `TripleFormat` class analogous to rdf4j's `RDFFormat` (such that any `TripleFormat` could be converted to a `RDFFormat` if desired), and then changing the method signature of `RDFFormat getFormat()` to `TripleFormat getFormat()`.
    
    The reason being: shouldn't all return types of methods (aside from the ubiquitous `IRI`, `BNode`, etc.) in new interfaces (e.g. `FormatWriterFactory`) be, preferably, part of our own API, rather than RDF4J's? Having our own `TripleFormat` class would give us more control over our own API. For example, suppose we were to add the following default method to the `TripleHandler` interface:
    
    ```java
    default void handleComment(String comment, ExtractionContext context) {
        //default implementation = do nothing
    }
    ```
    And then we wanted to add a `supportsComments` flag to the format returned by `FormatWriterFactory.getFormat()` (which we could set to `true` for, e.g., the `TurtleWriter`). Well, if we're using RDF4J's `RDFFormat` class, we could log an issue in RDF4J asking them to add that additional parameter, but we're pretty much at their mercy. However, if we had our own `TripleFormat` class, we could add an additional `TripleFormat` constructor with a `boolean supportsComments` parameter (and a default value of `false`).
    
    What do you think about this? 
    
    (The only reason I've hesitated so far about merging this PR is that once new interfaces are introduced as part of the core API, I'd prefer to never change them again--so I want to get it right the first time!)


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    A third option for (4) is to simply defer this decision to another day by removing the methods I added to `WriterFactoryRegistry` and adding them directly to `Rover` as private methods. This option is also tempting.


---

[GitHub] any23 pull request #122: ANY23-396 Overhaul WriterFactory API

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/any23/pull/122


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    @jgrzebyta thanks for your +1. Is the improved state of the javadoc to your liking?
    
    > I will add my unit in a separate ticket.
    
    I'm confused: are you referring to your existing PR #121 ? This PR is meant to be an alternative to that one. In my last commit, I have also added your [`ExtractorsFlowTest`](https://github.com/apache/any23/blob/f95f23865c0a7088e4ab1cbe507b8457fc90dda5/cli/src/test/java/org/apache/any23/cli/ExtractorsFlowTest.java) proof-of-concept to this PR to clarify that this PR provides at least as much functionality as #121 does. The only difference being: this PR uses the new `DelegatingWriterFactory` to accomplish the same behavior previously provided by `ModelExtractor` in #121. So if you approve this PR, my assumption would be that you prefer it over #121, and that #121 would be discarded.
    
    Any additional comments or concerns? Do I still have your +1?



---

[GitHub] any23 pull request #122: ANY23-396 allow mapping/filtering TripleHandlers in...

Posted by jgrzebyta <gi...@git.apache.org>.
Github user jgrzebyta commented on a diff in the pull request:

    https://github.com/apache/any23/pull/122#discussion_r217989655
  
    --- Diff: api/src/main/java/org/apache/any23/writer/FormatWriterFactory.java ---
    @@ -0,0 +1,112 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *  http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.any23.writer;
    +
    +import org.apache.any23.extractor.ExtractionContext;
    +import org.eclipse.rdf4j.common.lang.FileFormat;
    +import org.eclipse.rdf4j.model.IRI;
    +import org.eclipse.rdf4j.model.Resource;
    +import org.eclipse.rdf4j.model.Value;
    +import org.eclipse.rdf4j.rio.RDFFormat;
    +
    +import java.io.OutputStream;
    +
    +/**
    --- End diff --
    
    @HansBrende Please add some java doc. 


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by lewismc <gi...@git.apache.org>.
Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/122
  
    I've reviewed this patch a few times now and I think it is looking great. The above thread is very helpful to see how these API's changes cam about so thank you for the in-depth thought process you've engaged in. It makes a huge difference.
    Pull locally, tested, evaluated again.... I am +1 to merge.


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by lewismc <gi...@git.apache.org>.
Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/122
  
    I think this looks great. I have no further comments for improvement. I think the additions looks great.


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    Another (more drastic) option for (4) would be to deprecate `getWriters()` and `getWriterByIdentifier(String id)`, and create the replacement methods `getWriterFactories()` and `getWriterFactoryByIdentifier(String id)` (or simply, `getWriterFactory(String id)`.)
    
    Then we would be free to call writer instances "writers", and could leave the method names how they currently stand, namely: `getWriter(id, output)` and `getDefaultWriter(output)`


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    I've taken another look at the `RDFFormat` class, and it turns out that we don't really need the new method: `FileFormat getFormat()` because any `RDFFormat` can be converted from a `FileFormat` by constructing it with a `null` standard URI, and setting both "supports namespaces" and "supports contexts" to `false`. This should be applicable to any writer, even those that don't print out a standardized RDF format. E.g., in the `URIListWriter` class, "supports namespaces" and "supports contexts" are clearly false since the class only writes out subjects; but does not write out predicates, objects, namespaces, or contexts.
    
    Therefore, I think I'm going to drop the new `FileFormat getFormat()` method and retain the `RDFFormat getRdfFormat()` method.


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    > @HansBrende.  Sorry, I did not follow your latest commits. My proof of
    concept was done for different approach so I thought it would be useless in
    your one. I had thought I would need to write the new one. But if you
    reused so I am happy of that. You have still my +1.
    
    @jgrzebyta glad to hear I still have your +1! Yes, although I used your original unit tests, I did have to modify the way they were implemented. Here are the implementation changes I made:
    1. `ExtractorsFlowTest` diff: https://www.diffchecker.com/pPGAQxE6
    2. `PeopleExtractorFactory` diff: https://www.diffchecker.com/Mn7XTZOB
    3. `PeopleExtractor` diff: https://www.diffchecker.com/x4du9RqE



---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    I've added a proof-of-concept unit test, which deprecates the `--notrivial` flag and instead makes that the identifier for a `DelegatingWriterFactory`. Now you can simply specify:
    ```shell
    --format notrivial,nquads
    ```
    
    @lewismc any additional comments or concerns?
    
    @jgrzebyta can you please verify whether or not this PR will satisfy your use-case for [ANY23-396](https://issues.apache.org/jira/browse/ANY23-396)? Any additional comments or concerns?



---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by HansBrende <gi...@git.apache.org>.
Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/122
  
    Alright, as far as I'm concerned, all my cleanup is done here.
    
    @lewismc if you have no further comments, would you prefer I squash my commits before merging, or just merge everything in as-is?


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by lewismc <gi...@git.apache.org>.
Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/122
  
    I like this.
    As you've pointed out the deprecation is a bit harsh but it is certainly something we can live with. I would however urge you to add additional documentation e.g.
    ```
    /**
         * @deprecated
         * explanation of why function was deprecated, if possible include what 
         * should be used.
         */
    ```
    This looks really nice. 


---

[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

Posted by lewismc <gi...@git.apache.org>.
Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/122
  
    +1 from me


---