You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by osma <gi...@git.apache.org> on 2016/03/30 10:35:58 UTC

[GitHub] jena pull request: JENA-1134: support AnalyzingQueryParser in jena...

GitHub user osma opened a pull request:

    https://github.com/apache/jena/pull/131

    JENA-1134: support AnalyzingQueryParser in jena-text

    This PR makes it possible to select either the standard Lucene QueryParser or the AnalyzingQueryParser using jena-text configuration like this:
    
    ```
    <#indexLucene> a text:TextIndexLucene ;
        text:directory <file:Lucene> ;
        text:queryParser text:AnalyzingQueryParser ;
        text:queryAnalyzer [
            a text:ConfigurableAnalyzer ;
            text:tokenizer text:KeywordTokenizer ;
            text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
        ] 
        text:entityMap <#entMap> ;
    ```
    
    The main difference between these query parsers is that AnalyzingQueryParser performs analysis also for wildcard queries. For example, if you use ASCIIFoldingFilter as above, if you want a search for `édu*` to match `éducation` you need AnalyzingQueryParser.
    
    One problem I had with the implementation is that the query parser needs to be constructed dynamically for every query, so I need to store the information about which query parser to use instead of just storing the QueryParser/AnalyzingQueryParser instance directly. I solved this by simply storing the type of query parser as a string, i.e. either `"QueryParser"` or `"AnalyzingQueryParser"`, and then dynamically construct the correct type of parser based on this information. I'm sure there are more elegant ways of doing this, e.g. creating Factories for each parser type and saving the correct kind of Factory, but I don't want to overengineer. Opinions?
    
    This could rather easily be extended to other query parser types supported by Lucene, though I'm unsure how useful that would be in practice. ComplexPhraseQueryParser and/or PrecedenceQueryParser could perhaps be useful to somebody.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/osma/jena jena-text-queryparser

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/jena/pull/131.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #131
    
----
commit 547d4ac64e4331e45cba96e045345a5f3ab214a7
Author: Osma Suominen <os...@apache.org>
Date:   2016-03-29T14:23:27Z

    simplify parseQuery and preParseQuery: get rid of primaryField argument as it is always the same

commit 22a81f8cbc9498cbe4f1970115aa32f9c21fb239
Author: Osma Suominen <os...@apache.org>
Date:   2016-03-30T08:09:02Z

    JENA-1134: basic support for AnalyzingQueryParser

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-1134: support AnalyzingQueryParser in jena...

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/131#discussion_r57891059
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/TestDatasetWithAnalyzingQueryParser.java ---
    @@ -0,0 +1,64 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import java.util.Set ;
    +
    +import org.apache.jena.atlas.lib.StrUtils ;
    +import org.apache.jena.ext.com.google.common.collect.Sets ;
    +import org.junit.Before ;
    +import org.junit.Test ;
    +
    +/**
    + * This class defines a setup configuration for a dataset that uses an ASCII folding lowercase keyword analyzer with a Lucene index.
    + */
    +public class TestDatasetWithAnalyzingQueryParser extends TestDatasetWithConfigurableAnalyzer {
    +    @Override
    +    @Before
    +    public void before() {
    +        init(StrUtils.strjoinNL(
    +            "text:ConfigurableAnalyzer ;",
    +            "text:tokenizer text:KeywordTokenizer ;",
    +            "text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)"
    +        ), "text:AnalyzingQueryParser");
    +    }    
    +    
    +    @Test
    +    public void testAnalyzingQueryParserAnalyzesWildcards() {
    +        final String testName = "testAnalyzingQueryParserAnalyzesWildcards";
    +        final String turtle = StrUtils.strjoinNL(
    +                TURTLE_PROLOG,
    +                "<" + RESOURCE_BASE + testName + ">",
    +                "  rdfs:label 'éducation'@fr",
    +                ".",
    +                "<" + RESOURCE_BASE + "irrelevant>",
    +                "  rdfs:label 'déjà vu'@fr",
    +                "."
    +                );
    +        String queryString = StrUtils.strjoinNL(
    +                QUERY_PROLOG,
    +                "SELECT ?s",
    +                "WHERE {",
    +                "    ?s text:query ( rdfs:label 'édu*' 10 ) .",
    +                "}"
    +                );
    +        Set<String> expectedURIs = Sets.newHashSet(RESOURCE_BASE + testName);
    --- End diff --
    
    This was copied from another, similar unit test (TestDatasetWithConfigurableAnalyzer.java). I could change both, of course, but I doubt this makes a big difference for readability.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-1134: support AnalyzingQueryParser in jena...

Posted by joelkuiper <gi...@git.apache.org>.

Github user joelkuiper commented on the pull request:

    https://github.com/apache/jena/pull/131#issuecomment-217628988
  
    \U0001f44d I kinda need ComplexPhraseQueryParser for some search, would it be good to add that too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-1134: support AnalyzingQueryParser in jena...

Posted by osma <gi...@git.apache.org>.

Github user osma closed the pull request at:

    https://github.com/apache/jena/pull/131


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-1134: support AnalyzingQueryParser in jena...

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the pull request:

    https://github.com/apache/jena/pull/131#issuecomment-204270064
  
    Any other comments? I will merge this some time next week unless someone objects.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-1134: support AnalyzingQueryParser in jena...

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/131#discussion_r57889145
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/TestDatasetWithAnalyzingQueryParser.java ---
    @@ -0,0 +1,64 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import java.util.Set ;
    +
    +import org.apache.jena.atlas.lib.StrUtils ;
    +import org.apache.jena.ext.com.google.common.collect.Sets ;
    +import org.junit.Before ;
    +import org.junit.Test ;
    +
    +/**
    + * This class defines a setup configuration for a dataset that uses an ASCII folding lowercase keyword analyzer with a Lucene index.
    + */
    +public class TestDatasetWithAnalyzingQueryParser extends TestDatasetWithConfigurableAnalyzer {
    +    @Override
    +    @Before
    +    public void before() {
    +        init(StrUtils.strjoinNL(
    +            "text:ConfigurableAnalyzer ;",
    +            "text:tokenizer text:KeywordTokenizer ;",
    +            "text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)"
    +        ), "text:AnalyzingQueryParser");
    +    }    
    +    
    +    @Test
    +    public void testAnalyzingQueryParserAnalyzesWildcards() {
    +        final String testName = "testAnalyzingQueryParserAnalyzesWildcards";
    +        final String turtle = StrUtils.strjoinNL(
    +                TURTLE_PROLOG,
    +                "<" + RESOURCE_BASE + testName + ">",
    +                "  rdfs:label 'éducation'@fr",
    +                ".",
    +                "<" + RESOURCE_BASE + "irrelevant>",
    +                "  rdfs:label 'déjà vu'@fr",
    +                "."
    +                );
    +        String queryString = StrUtils.strjoinNL(
    +                QUERY_PROLOG,
    +                "SELECT ?s",
    +                "WHERE {",
    +                "    ?s text:query ( rdfs:label 'édu*' 10 ) .",
    +                "}"
    +                );
    +        Set<String> expectedURIs = Sets.newHashSet(RESOURCE_BASE + testName);
    --- End diff --
    
    For a single element like this, I'd prefer `Collectons::singleton`, but that's just me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-1134: support AnalyzingQueryParser in jena...

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/131#discussion_r57891292
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/TestDatasetWithAnalyzingQueryParser.java ---
    @@ -0,0 +1,64 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import java.util.Set ;
    +
    +import org.apache.jena.atlas.lib.StrUtils ;
    +import org.apache.jena.ext.com.google.common.collect.Sets ;
    +import org.junit.Before ;
    +import org.junit.Test ;
    +
    +/**
    + * This class defines a setup configuration for a dataset that uses an ASCII folding lowercase keyword analyzer with a Lucene index.
    + */
    +public class TestDatasetWithAnalyzingQueryParser extends TestDatasetWithConfigurableAnalyzer {
    +    @Override
    +    @Before
    +    public void before() {
    +        init(StrUtils.strjoinNL(
    +            "text:ConfigurableAnalyzer ;",
    +            "text:tokenizer text:KeywordTokenizer ;",
    +            "text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)"
    +        ), "text:AnalyzingQueryParser");
    +    }    
    +    
    +    @Test
    +    public void testAnalyzingQueryParserAnalyzesWildcards() {
    +        final String testName = "testAnalyzingQueryParserAnalyzesWildcards";
    +        final String turtle = StrUtils.strjoinNL(
    +                TURTLE_PROLOG,
    +                "<" + RESOURCE_BASE + testName + ">",
    +                "  rdfs:label 'éducation'@fr",
    +                ".",
    +                "<" + RESOURCE_BASE + "irrelevant>",
    +                "  rdfs:label 'déjà vu'@fr",
    +                "."
    +                );
    +        String queryString = StrUtils.strjoinNL(
    +                QUERY_PROLOG,
    +                "SELECT ?s",
    +                "WHERE {",
    +                "    ?s text:query ( rdfs:label 'édu*' 10 ) .",
    +                "}"
    +                );
    +        Set<String> expectedURIs = Sets.newHashSet(RESOURCE_BASE + testName);
    --- End diff --
    
    Yes, in that case, definitely more important to keep things consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-1134: support AnalyzingQueryParser in jena...

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the pull request:

    https://github.com/apache/jena/pull/131#issuecomment-206240991
  
    Merged after rebasing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-1134: support AnalyzingQueryParser in jena...

Posted by afs <gi...@git.apache.org>.

Github user afs commented on the pull request:

    https://github.com/apache/jena/pull/131#issuecomment-205917467
  
    Looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-1134: support AnalyzingQueryParser in jena...

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the pull request:

    https://github.com/apache/jena/pull/131#issuecomment-217813686
  
    @joelkuiper Please create a new JIRA ticket for ComplexPhraseQueryParser support on issues.apache.org, as this issue/PR is already closed.
    
    It shouldn't be very hard to add support for ComplexPhraseQueryParser because the framework is already done - it's basically adding an extra case to `TextIndexLucene.getQueryParser()` along with a unit test that can be pretty much copied and pasted from `TestDatasetWithAnalyzingQueryParser`, except changing the actual test to something which only ComplexPhraseQueryParser can do. Pull requests also very welcome ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---