You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by lewismc <gi...@git.apache.org> on 2016/04/06 21:50:17 UTC

[GitHub] any23 pull request: Initial move towards addressing ANY23-280 Refa...

GitHub user lewismc opened a pull request:

    https://github.com/apache/any23/pull/24

    Initial move towards addressing ANY23-280 Refactor ContentExtractor to improve extraction flexibility

    Hi Folks,
    This is an initial crack at addressing https://issues.apache.org/jira/browse/ANY23-280
    Essentially, the main API difference is the complete removal of ```public interface ContentExtractor extends Extractor<InputStream>``` from the Extractor interface in the api module.
    This patch has a long way to go with numerous failing tests however I wanted to post it for feedback.
    Although Any23 still builds with -DskipTests, without that flag the failing tests are as follows
    ```
    Results :
    
    Failed tests:
      Any23Test.testDemoCodeSnippet1:201
      Any23Test.testN3Detection1:92->assertDetection:661
      Any23Test.testN3Detection2:97->assertDetection:661
      Any23Test.testTTLDetection:87->assertDetection:661
      RoverTest.testRunMultiURLs:104->runWithMultiSourcesAndVerify:134 Unexpected number of statements.
    Tests in error:
      Any23Test.testProgrammaticExtraction:279 » NullPointer
    CSVExtractorTest.testExtractionCommaSeparated:49->AbstractExtractorTestCase.dumpModelToRDFXML:714 » Runtime
    CSVExtractorTest.testExtractionEmptyValue:112->AbstractExtractorTestCase.dumpModelToRDFXML:714 » Runtime
    CSVExtractorTest.testExtractionSemicolonSeparated:64->AbstractExtractorTestCase.dumpModelToRDFXML:714 » Runtime
    CSVExtractorTest.testExtractionTabSeparated:79->AbstractExtractorTestCase.dumpModelToRDFXML:714 » Runtime
    CSVExtractorTest.testTypeManagement:94->AbstractExtractorTestCase.dumpModelToRDFXML:714 » Runtime
    RDFa11ExtractorTest>AbstractRDFaExtractorTestCase.testDrupalTestPage:124->AbstractExtractorTestCase.assertExtract:217->AbstractExtractorTestCase.assertExtract:200->AbstractExtractorTestCase.extract:185 » NullPointer
    RDFaExtractorTest>AbstractRDFaExtractorTestCase.testDrupalTestPage:124->AbstractExtractorTestCase.assertExtract:217->AbstractExtractorTestCase.assertExtract:200->AbstractExtractorTestCase.extract:185 » NullPointer
    Tests run: 403, Failures: 5, Errors: 8, Skipped: 11
    ```
    You will see that some of the tests concern https://issues.apache.org/jira/browse/ANY23-267 as well.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lewismc/any23 ANY23-280

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/any23/pull/24.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #24
    
----
commit 801f2f93967bfd1295700223085eef3f54181517
Author: Lewis John McGibbney <le...@jpl.nasa.gov>
Date:   2016-04-06T19:44:35Z

    Initial move towards addressing ANY23-280 Refactor ContentExtractor to improve extraction flexibility

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 pull request #24: Initial move towards addressing ANY23-280 Refactor C...

Posted by lewismc <gi...@git.apache.org>.
Github user lewismc commented on a diff in the pull request:

    https://github.com/apache/any23/pull/24#discussion_r134831365
  
    --- Diff: api/src/main/java/org/apache/any23/extractor/Extractor.java ---
    @@ -39,22 +38,6 @@
     
         /**
          * This interface specializes an {@link Extractor} able to handle
    -     * {@link java.io.InputStream} as input format.
    -     */
    -    public interface ContentExtractor extends Extractor<InputStream> {
    --- End diff --
    
    @jgrzebyta yes this is correct... we do not always wish to assume that the input is structured in XML or a subset thereof... syntax-strict extractors are prone to breakage. Our aim in Any23 should be to provide flexibility in the extraction logic rather than a strict, fragile extraction logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #24: Initial move towards addressing ANY23-280 Refactor ContentE...

Posted by ansell <gi...@git.apache.org>.
Github user ansell commented on the issue:

    https://github.com/apache/any23/pull/24
  
    If we are going to be modifying the public API we probably should be aiming for a 2.0 release, otherwise the version numbers are arbitrary


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #24: Initial move towards addressing ANY23-280 Refactor ContentE...

Posted by jgrzebyta <gi...@git.apache.org>.
Github user jgrzebyta commented on the issue:

    https://github.com/apache/any23/pull/24
  
    @lewismc  Regarding new low level interface is it planned any higher level interface? I mean something what might be useful to create RDF graph fulfilling a custom ontology from raw rdf graph. For example there is csv -> rdf extractor. But in practice that low level rdf should be converted to the final one using at least one construct type SPARQL query. I thought it might be possible to process that using programmable API. Unfortunately RDF4J QueryBuilder API supports only simple queries.


---

[GitHub] any23 pull request #24: Initial move towards addressing ANY23-280 Refactor C...

Posted by lewismc <gi...@git.apache.org>.
Github user lewismc closed the pull request at:

    https://github.com/apache/any23/pull/24


---

[GitHub] any23 issue #24: Initial move towards addressing ANY23-280 Refactor ContentE...

Posted by lewismc <gi...@git.apache.org>.
Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/24
  
    > There are a large number of whitespace modifications to change from tabs to 2-space indentation. Is 2-space indentation what Any23 is aiming for, given that most java code is either tab or 4-space indentation.
    
    I'll revert these changes to 4 spaces as per remainder of codebase and force an update to this PR.
    
    > If we are going to be modifying the public API we probably should be aiming for a 2.0 release, otherwise the version numbers are arbitrary
    
    I would have no issues with this as all... it is a v good suggestion.
    
    > Given how broad this pull request is, it needs to be completed before I can work on some of the issues I have assigned to me.
    
    Agreed. I'll put some time in to it this week and see if I can complete it, stabilize tests and update the PR for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 pull request: Initial move towards addressing ANY23-280 Refa...

Posted by lewismc <gi...@git.apache.org>.
Github user lewismc commented on the pull request:

    https://github.com/apache/any23/pull/24#issuecomment-212761670
  
    Anyone get a chance to take a look at this? This is THE critical issue for Any23 to address right now IMHO.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 pull request #24: Initial move towards addressing ANY23-280 Refactor C...

Posted by jgrzebyta <gi...@git.apache.org>.
Github user jgrzebyta commented on a diff in the pull request:

    https://github.com/apache/any23/pull/24#discussion_r134738201
  
    --- Diff: api/src/main/java/org/apache/any23/extractor/Extractor.java ---
    @@ -39,22 +38,6 @@
     
         /**
          * This interface specializes an {@link Extractor} able to handle
    -     * {@link java.io.InputStream} as input format.
    -     */
    -    public interface ContentExtractor extends Extractor<InputStream> {
    --- End diff --
    
    @lewismc Why do you remove `ContentExtractor`? I assume that In case if content is neither html nor xml type that developer should create new extractor extending `Exctractor<Input>` directly. Am I right? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 pull request #24: Initial move towards addressing ANY23-280 Refactor C...

Posted by ansell <gi...@git.apache.org>.
Github user ansell commented on a diff in the pull request:

    https://github.com/apache/any23/pull/24#discussion_r67792344
  
    --- Diff: plugins/office-scraper/src/test/java/org/apache/any23/plugin/officescraper/ExcelExtractorTest.java ---
    @@ -40,6 +42,8 @@
     import org.slf4j.Logger;
     import org.slf4j.LoggerFactory;
     
    +import com.fasterxml.jackson.databind.introspect.WithMember;
    --- End diff --
    
    This import seems to be unused


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #24: Initial move towards addressing ANY23-280 Refactor ContentE...

Posted by ansell <gi...@git.apache.org>.
Github user ansell commented on the issue:

    https://github.com/apache/any23/pull/24
  
    Given how broad this pull request is, it needs to be completed before I can work on some of the issues I have assigned to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] any23 issue #24: Initial move towards addressing ANY23-280 Refactor ContentE...

Posted by ansell <gi...@git.apache.org>.
Github user ansell commented on the issue:

    https://github.com/apache/any23/pull/24
  
    There are a large number of whitespace modifications to change from tabs to 2-space indentation. Is 2-space indentation what Any23 is aiming for, given that most java code is either tab or 4-space indentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---