You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by anujgandharv <gi...@git.apache.org> on 2017/03/14 15:26:32 UTC

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

GitHub user anujgandharv opened a pull request:

    https://github.com/apache/jena/pull/227

    JENA-1305 | Elastic search support for Jena Text

    Implemented ES support for Jena Text Indexing capability

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/EaseTech/jena jena-1301-es-support

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/jena/pull/227.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #227
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107672239
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/assembler/TextIndexESAssembler.java ---
    @@ -0,0 +1,129 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text.assembler;
    +
    +import org.apache.jena.assembler.Assembler;
    +import org.apache.jena.assembler.Mode;
    +import org.apache.jena.assembler.assemblers.AssemblerBase;
    +import org.apache.jena.query.text.*;
    +import org.apache.jena.rdf.model.RDFNode;
    +import org.apache.jena.rdf.model.Resource;
    +import org.apache.jena.rdf.model.Statement;
    +import org.apache.jena.sparql.util.graph.GraphUtils;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +import static org.apache.jena.query.text.assembler.TextVocab.*;
    +
    +public class TextIndexESAssembler extends AssemblerBase {
    +
    +    private static Logger LOGGER      = LoggerFactory.getLogger(TextIndexESAssembler.class) ;
    +
    +    protected static final String COMMA = ",";
    +    protected static final String COLON = ":";
    +    /*
    +    <#index> a :TextIndexES ;
    +        text:serverList "127.0.0.1:9300,127.0.0.2:9400,127.0.0.3:9500" ; #Comma separated list of hosts:ports
    +        text:clusterName "elasticsearch"
    +        text:shards "1"
    +        text:replicas "1"
    +        text:entityMap <#endMap> ;
    +        .
    +    */
    +    
    +    @SuppressWarnings("resource")
    +    @Override
    +    public TextIndex open(Assembler a, Resource root, Mode mode) {
    +        try {
    +            String listOfHostsAndPorts = GraphUtils.getAsStringValue(root, pServerList) ;
    +            if(listOfHostsAndPorts == null || listOfHostsAndPorts.isEmpty()) {
    +                throw new TextIndexException("Mandatory property text:serverList (containing the comma-separated list of host:port) property is not specified. " +
    +                        "An example value for the property: 127.0.0.1:9300");
    +            }
    +            String[] hosts = listOfHostsAndPorts.split(COMMA);
    +            Map<String,Integer> hostAndPortMapping = new HashMap<>();
    +            for(String host : hosts) {
    +                String[] hostAndPort = host.split(COLON);
    +                if(hostAndPort.length < 2) {
    +                    LOGGER.error("Either the host or the port value is missing.Please specify the property in host:port format. " +
    +                            "Both parts are mandatory. Ignoring this value. Moving to the next one.");
    +                    continue;
    +                }
    +                hostAndPortMapping.put(hostAndPort[0], Integer.valueOf(hostAndPort[1]));
    +            }
    +
    +            String clusterName = GraphUtils.getAsStringValue(root, pClusterName);
    +            if(clusterName == null || clusterName.isEmpty()) {
    +                LOGGER.warn("ClusterName property is not specified. Defaulting to 'elasticsearch'");
    +                clusterName = "elasticsearch";
    +            }
    +
    +            String numberOfShards = GraphUtils.getAsStringValue(root, pShards);
    +            if(numberOfShards == null || numberOfShards.isEmpty()) {
    +                LOGGER.warn("shards property is not specified. Defaulting to '1'");
    +                numberOfShards = "1";
    +            }
    +
    +            String replicationFactor = GraphUtils.getAsStringValue(root, pReplicas);
    +            if(replicationFactor == null || replicationFactor.isEmpty()) {
    +                LOGGER.warn("replicas property is not specified. Defaulting to '1'");
    +                replicationFactor = "1";
    +            }
    +
    +            String indexName = GraphUtils.getAsStringValue(root, pIndexName);
    +            if(indexName == null || indexName.isEmpty()) {
    +                LOGGER.warn("index Name property is not specified. Defaulting to 'jena-text'");
    +                indexName = "jena-text";
    +            }
    +
    +            boolean isMultilingualSupport = false;
    +            Statement mlSupportStatement = root.getProperty(pMultilingualSupport);
    +            if (null != mlSupportStatement) {
    +                RDFNode mlsNode = mlSupportStatement.getObject();
    +                if (! mlsNode.isLiteral()) {
    +                    throw new TextIndexException("text:multilingualSupport property must be a string : " + mlsNode);
    +                }
    +                isMultilingualSupport = mlsNode.asLiteral().getBoolean();
    +            }
    +
    +
    +
    +            Resource r = GraphUtils.getResourceValue(root, pEntityMap) ;
    +            EntityDefinition docDef = (EntityDefinition)a.open(r) ;
    +            TextIndexConfig config = new TextIndexConfig(docDef);
    +            config.setMultilingualSupport(isMultilingualSupport);
    --- End diff --
    
    same as above, can be removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @osma @ajs6f I have made the necessary changes to the ES TextIndex based on changes in #226 
    I want to bring one thing to notice:
    If the query string is: `?s text:query ('word' 'lang:en' )`, then the query method receives the following attributes: `*null*, "word", null, "en" `, and NOT `RDFS.label.asNode(), "word", null, "en"`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106154731
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    --- End diff --
    
    Agree. Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106153318
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/TestTextIndexES.java ---
    @@ -0,0 +1,184 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.jena.query.text;
    +
    +
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.vocabulary.RDFS;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.test.ESIntegTestCase;
    +import org.junit.Assert;
    +import org.junit.Ignore;
    +import org.junit.Test;
    +
    +import java.util.List;
    +import java.util.Map;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + *
    + * Integration test for {@link TextIndexES} class
    + * ES Integration test depends on security policies that may sometime not be loaded properly.
    + * If you find any issues regarding security set the following VM argument to resolve the issue:
    + * -Dtests.security.manager=false
    + *
    + */
    +@ESIntegTestCase.ClusterScope()
    +public class TestTextIndexES extends ESIntegTestCase {
    --- End diff --
    
    Unfortunately, currently I do not have/know a mechanism to suppress the logs. But I will dig a bit deeper.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106151647
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/ESSettings.java ---
    @@ -0,0 +1,177 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.jena.query.text;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +/**
    + * Settings for ElasticSearch based indexing
    + */
    +public class ESSettings {
    +
    +    /**
    +     * Map of hosts and ports. The host could also be an IP Address
    +     */
    +    private Map<String,Integer> hostToPortMapping;
    +
    +    /**
    +     * Name of the Cluster. Defaults to 'elasticsearch'
    +     */
    +    private String clusterName;
    +
    +    /**
    +     * Number of shards. Defaults to '1'
    +     */
    +    private Integer shards;
    +
    +    /**
    +     * Number of replicas. Defaults to '1'
    +     */
    +    private Integer replicas;
    +
    +    /**
    +     * Name of the index. Defaults to 'test'
    +     */
    +    private String indexName;
    +
    +
    +    public Map<String, Integer> getHostToPortMapping() {
    +        return hostToPortMapping;
    +    }
    +
    +    public void setHostToPortMapping(Map<String, Integer> hostToPortMapping) {
    +        this.hostToPortMapping = hostToPortMapping;
    +    }
    +
    +    public ESSettings.Builder builder() {
    +        return new ESSettings.Builder();
    +    }
    +
    +    /**
    +     * Convenient builder class for building ESSettings
    +     */
    +    public static class Builder {
    +
    +        ESSettings settings;
    +
    +        public Builder() {
    +            this.settings = new ESSettings();
    +            this.settings.setClusterName("elasticsearch");
    +            this.settings.setShards(1);
    +            this.settings.setReplicas(1);
    +            this.settings.setHostToPortMapping(new HashMap<>());
    +            this.settings.setIndexName("test");
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106428015
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +
    +        String script = "if(ctx._source.<fieldToRemove> != null && (ctx._source.<fieldToRemove>.empty != true) " +
    +                "&& (ctx._source.<fieldToRemove>.indexOf('<valueToRemove>') >= 0)) " +
    +                "{ctx._source.<fieldToRemove>.remove(ctx._source.<fieldToRemove>.indexOf('<valueToRemove>'))}";
    +        script = script.replaceAll("<fieldToRemove>", fieldToRemove).replaceAll("<valueToRemove>", valueToRemove);
    --- End diff --
    
    I completely agree but unfortunately,  I have no way to test it out. That was the main reason for me to resort to this. The reason I cant test it out is because ES has stopped releasing plugin for Painless scripting and as a result I can not write any tests, without resorting to some ugly workaround. I have tested the above script locally and it works and that was the reason I have it like this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106143728
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/TestTextIndexES.java ---
    @@ -0,0 +1,184 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.jena.query.text;
    +
    +
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.vocabulary.RDFS;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.test.ESIntegTestCase;
    +import org.junit.Assert;
    +import org.junit.Ignore;
    +import org.junit.Test;
    +
    +import java.util.List;
    +import java.util.Map;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + *
    + * Integration test for {@link TextIndexES} class
    + * ES Integration test depends on security policies that may sometime not be loaded properly.
    + * If you find any issues regarding security set the following VM argument to resolve the issue:
    + * -Dtests.security.manager=false
    + *
    + */
    +@ESIntegTestCase.ClusterScope()
    +public class TestTextIndexES extends ESIntegTestCase {
    --- End diff --
    
    It's great that you have working unit/integration tests! With your latest commit they are also propely hooked to `mvn test`.
    
    However, the tests produce quite a lot of output that seems unnecessary to me. Is there something you could do to reduce that? Generally tests that run successfully shouldn't produce any output when run via `mvn test`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on the issue:

    https://github.com/apache/jena/pull/227
  
    I'm just catching up with this PR, so apologies if I turn out to have missed part of the conversation!
    
    If there is difficulty getting tests integrated into the Maven lifecycle within one module, we do now have the option of opening some integration tests in the module that now exists for this purpose (`jena-integration-tests`).
    
    As far as embedded servers go, I'm not as familiar with ES as I would like to be, but my understanding is that [ES is moving in the same direction as Solr](https://www.elastic.co/blog/elasticsearch-the-server) (booooo!) and will not support embedded operation or even a proper WAR (booooo!). So if we want to support it for the long-term, we need to find a pattern that works for running tests against it. On that front, this looks intriguing:
    
    https://github.com/alexcojocaru/elasticsearch-maven-plugin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106402731
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    --- End diff --
    
    Yes, but the comment here still says it defaults to `test`, it should be updated as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106142437
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/ESSettings.java ---
    @@ -0,0 +1,177 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.jena.query.text;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +/**
    + * Settings for ElasticSearch based indexing
    + */
    +public class ESSettings {
    +
    +    /**
    +     * Map of hosts and ports. The host could also be an IP Address
    +     */
    +    private Map<String,Integer> hostToPortMapping;
    +
    +    /**
    +     * Name of the Cluster. Defaults to 'elasticsearch'
    +     */
    +    private String clusterName;
    +
    +    /**
    +     * Number of shards. Defaults to '1'
    +     */
    +    private Integer shards;
    +
    +    /**
    +     * Number of replicas. Defaults to '1'
    +     */
    +    private Integer replicas;
    +
    +    /**
    +     * Name of the index. Defaults to 'test'
    +     */
    +    private String indexName;
    +
    +
    +    public Map<String, Integer> getHostToPortMapping() {
    +        return hostToPortMapping;
    +    }
    +
    +    public void setHostToPortMapping(Map<String, Integer> hostToPortMapping) {
    +        this.hostToPortMapping = hostToPortMapping;
    +    }
    +
    +    public ESSettings.Builder builder() {
    +        return new ESSettings.Builder();
    +    }
    +
    +    /**
    +     * Convenient builder class for building ESSettings
    +     */
    +    public static class Builder {
    +
    +        ESSettings settings;
    +
    +        public Builder() {
    +            this.settings = new ESSettings();
    +            this.settings.setClusterName("elasticsearch");
    +            this.settings.setShards(1);
    +            this.settings.setReplicas(1);
    +            this.settings.setHostToPortMapping(new HashMap<>());
    +            this.settings.setIndexName("test");
    --- End diff --
    
    Should default to "jena-text"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106146503
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            /**
    +             * We are creating an upsert request here instead of a simple insert request.
    +             * The reason is we want to add a document if it does not exist with the given Subject Id (URI).
    +             * But if the document exists with the same Subject Id, we want to do an update to it instead of deleting it and
    +             * then creating it with only the latest field values.
    +             * This functionality is called Upsert functionality and more can be learned about it here:
    +             * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
    +             */
    +
    +            //First Search of the field exists or not
    +            SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                    .setTypes(docDef.getEntityField())
    +                    .setQuery(QueryBuilders.existsQuery(fieldToAdd))
    +                    .get();
    +            String script;
    +            if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +                //This means field already exists and therefore we should append to it
    +                script = "ctx._source." + fieldToAdd+".add('"+ fieldValueToAdd + "')";
    +            } else {
    +                //The field does not exists. so we create one
    +                script = "ctx._source." + fieldToAdd+" =['"+ fieldValueToAdd + "']";
    +            }
    +
    +
    +
    +            UpdateRequest upReq = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(script))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +
    +
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +        //First Search of the field exists or not
    +        SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.existsQuery(fieldToRemove))
    +                .get();
    +
    +        String script = null;
    +        if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    --- End diff --
    
    Similar to the update operation, this may cause a race condition if multiple deletes happen simultaneously. Would it be possible to do an atomic delete here, scripted in such a way that there is no error thrown even if the field value to delete doesn't exist in the index?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106140485
  
    --- Diff: jena-parent/pom.xml ---
    @@ -258,37 +257,92 @@
             <version>${ver.lucene}</version>
           </dependency>
     
    -      <!-- Solr client -->
    -      <!-- Exclusion of slf4j: Necessary so as to pick the version we want. 
    -           solrj->zookeeper has a dependency on slf4j -->
    +      <!-- For jena-spatial -->
    +      <dependency>
    +        <groupId>org.apache.lucene</groupId>
    +        <artifactId>lucene-spatial</artifactId>
    +        <version>${ver.lucene}</version>
    +      </dependency>
    +
    +      <dependency>
    +        <groupId>org.apache.lucene</groupId>
    +        <artifactId>lucene-spatial-extras</artifactId>
    +        <version>${ver.lucene}</version>
    +      </dependency>
    +
    +      <dependency>
    +        <groupId>org.locationtech.spatial4j</groupId>
    +        <artifactId>spatial4j</artifactId>
    +        <version>${ver.spatial4j}</version>
    +      </dependency>
     
    +      <!-- ES dependencies-->
           <dependency>
    -        <artifactId>solr-solrj</artifactId>
    -        <groupId>org.apache.solr</groupId>
    -        <version>${ver.solr}</version>
    +        <groupId>org.elasticsearch</groupId>
    +        <artifactId>elasticsearch</artifactId>
    +        <version>${ver.elasticsearch}</version>
             <exclusions>
               <exclusion>
    -            <groupId>org.slf4j</groupId>
    -            <artifactId>slf4j-api</artifactId>
    +            <groupId>commons-logging</groupId>
    +            <artifactId>commons-logging</artifactId>
               </exclusion>
               <exclusion>
    -            <groupId>org.slf4j</groupId>
    -            <artifactId>slf4j-jdk14</artifactId>
    +            <groupId>org.hamcrest</groupId>
    +            <artifactId>hamcrest-core</artifactId>
               </exclusion>
             </exclusions>
    +
           </dependency>
     
    -      <!-- For jena-spatial -->
    +      <dependency>
    +        <groupId>org.elasticsearch.client</groupId>
    +        <artifactId>transport</artifactId>
    +        <version>${ver.elasticsearch}</version>
    +        <exclusions>
    +          <exclusion>
    +            <groupId>commons-logging</groupId>
    +            <artifactId>commons-logging</artifactId>
    +          </exclusion>
    +          <exclusion>
    +            <groupId>org.hamcrest</groupId>
    +            <artifactId>hamcrest-core</artifactId>
    +          </exclusion>
    +        </exclusions>
    +      </dependency>
    +
    +
           <dependency>
             <groupId>org.apache.lucene</groupId>
    -        <artifactId>lucene-spatial</artifactId>
    +        <artifactId>lucene-test-framework</artifactId>
    --- End diff --
    
    Is the dependency on lucene-test-framework necessary? I commented it out in the POM and the build and tests ran just fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    Thanks @osma . Can you point me to the SNAPSHOT repo please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r108438170
  
    --- Diff: jena-text/pom.xml ---
    @@ -112,11 +138,72 @@
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-surefire-plugin</artifactId>
             <configuration>
    -          <includes>
    -            <include>**/TS_*.java</include>
    -          </includes>
    +            <!-- Skip the default running of this plug-in (or everything is run twice...see below) -->
    +            <skip>true</skip>
             </configuration>
    +          <executions>
    +              <execution>
    +                  <id>unit-tests</id>
    +                  <phase>test</phase>
    +                  <goals>
    +                      <goal>test</goal>
    +                  </goals>
    +                  <configuration>
    +                      <skip>false</skip>
    +                      <includes>
    +                          <include>**/TS_*.java</include>
    +                      </includes>
    +                      <excludes>
    +                          <exclude>**/*IT.java</exclude>
    +                      </excludes>
    +                  </configuration>
    +              </execution>
    +              <execution>
    +                  <id>integration-tests</id>
    +                  <phase>integration-test</phase>
    +                  <goals>
    +                      <goal>test</goal>
    +                  </goals>
    +                  <configuration>
    +                      <skip>false</skip>
    +                      <includes>
    +                          <include>**/*IT.java</include>
    +                      </includes>
    +                  </configuration>
    +              </execution>
    +          </executions>
           </plugin>
    +        <plugin>
    +            <groupId>com.github.alexcojocaru</groupId>
    +            <artifactId>elasticsearch-maven-plugin</artifactId>
    +            <!-- REPLACE THE FOLLOWING WITH THE PLUGIN VERSION YOU NEED -->
    +            <version>5.2</version>
    +            <configuration>
    +                <clusterName>elasticsearch</clusterName>
    +                <transportPort>9500</transportPort>
    +                <httpPort>9400</httpPort>
    +            </configuration>
    +            <executions>
    +                <!--
    --- End diff --
    
    If the default bindings are the ones we want, why do we repeat them below?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106097571
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,425 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    --- End diff --
    
    Could you catch exceptions within this class, and if necessary throw a TextIndexException that wraps the original exception? See how TextIndexLucene does it. That way you don't need a `throws Exception` declaration which passes the responsibility of exception handling to the caller. (TextIndexException is a subclass of RuntimeException so doesn't need a `throws` declaration)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106236720
  
    --- Diff: jena-text/pom.xml ---
    @@ -81,39 +81,50 @@
           <artifactId>lucene-queryparser</artifactId>
         </dependency>
     
    -    <!-- Solr client -->
    -    <dependency>
    -      <artifactId>solr-solrj</artifactId>
    -      <groupId>org.apache.solr</groupId>
    -    </dependency>
    -
    -    <!-- Embedded server if used for testing
    -    <dependency>
    -      <artifactId>solr-core</artifactId>
    -      <groupId>org.apache.solr</groupId>
    -      <version>${ver.solr}</version>
    -      <type>jar</type>
    -      <scope>test</scope>
    -      <optional>true</optional>
    -      <exclusions>
    -        <exclusion>
    -          <groupId>org.slf4j</groupId>
    -          <artifactId>slf4j-api</artifactId>
    -        </exclusion>
    -        <exclusion>
    -          <groupId>org.slf4j</groupId>
    -          <artifactId>slf4j-jdk14</artifactId>
    -        </exclusion>
    -      </exclusions>
    -    </dependency>
    -
    -    <dependency>
    -      <groupId>javax.servlet</groupId>
    -      <artifactId>servlet-api</artifactId>
    -      <version>2.5</version>
    -      <scope>test</scope>
    -    </dependency>
    -    -->
    +      <dependency>
    +          <groupId>org.elasticsearch</groupId>
    +          <artifactId>elasticsearch</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.elasticsearch.client</groupId>
    +          <artifactId>transport</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.apache.lucene</groupId>
    +          <artifactId>lucene-test-framework</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.elasticsearch.test</groupId>
    +          <artifactId>framework</artifactId>
    +      </dependency>
    +
    +      <!-- This is required to by pass ES JAR Hell in test environment-->
    +      <dependency>
    +          <groupId>junit</groupId>
    +          <artifactId>junit</artifactId>
    +          <exclusions>
    +              <exclusion>
    +                  <groupId>org.hamcrest</groupId>
    +                  <artifactId>hamcrest-core</artifactId>
    +              </exclusion>
    +          </exclusions>
    +      </dependency>
    +
    +      <dependency>
    --- End diff --
    
    Ah sorry, missed that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    I suggest you move the existing tests from TestTextIndexES to the integration tests. The tests that rely on an embedded ES have no future anyway, since the ES embedded mode is already crippled and is going away completely soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106155260
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    --- End diff --
    
    The way it works in ES implementation is that we store the field in exactly one field, but during search we search the language-specific as well as language-agnostic fields.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106145497
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,425 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @osma Let me try to merge your changes in #226 to my code and see if I can turn it around today.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107671133
  
    --- Diff: jena-text/pom.xml ---
    @@ -112,11 +141,73 @@
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-surefire-plugin</artifactId>
             <configuration>
    -          <includes>
    -            <include>**/TS_*.java</include>
    -          </includes>
    +            <!-- Skip the default running of this plug-in (or everything is run twice...see below) -->
    +            <skip>true</skip>
    +            <!--<excludedGroups>org.apache.jena.query.text.IntegrationTest</excludedGroups>-->
             </configuration>
    +          <executions>
    +              <execution>
    +                  <id>unit-tests</id>
    +                  <phase>test</phase>
    +                  <goals>
    +                      <goal>test</goal>
    +                  </goals>
    +                  <configuration>
    +                      <skip>false</skip>
    +                      <includes>
    +                          <include>**/TS_*.java</include>
    +                      </includes>
    +                      <excludes>
    +                          <exclude>**/*IT.java</exclude>
    +                      </excludes>
    +                  </configuration>
    +              </execution>
    +              <execution>
    +                  <id>integration-tests</id>
    +                  <phase>integration-test</phase>
    +                  <goals>
    +                      <goal>test</goal>
    +                  </goals>
    +                  <configuration>
    +                      <skip>false</skip>
    +                      <includes>
    +                          <include>**/*IT.java</include>
    +                      </includes>
    +                  </configuration>
    +              </execution>
    +          </executions>
           </plugin>
    +        <plugin>
    +            <groupId>com.github.alexcojocaru</groupId>
    +            <artifactId>elasticsearch-maven-plugin</artifactId>
    +            <!-- REPLACE THE FOLLOWING WITH THE PLUGIN VERSION YOU NEED -->
    +            <version>5.2</version>
    +            <configuration>
    +                <clusterName>elasticsearch</clusterName>
    +                <tcpPort>9300</tcpPort>
    --- End diff --
    
    These are the default ES ports. In case there is already a running ES instance, the ports will clash. This is quite likely for someone who intends to use jena-text with ES. I suggest switching to e.g. ports 9500 and 9400


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @osma @ajs6f I have added integration tests for ES based Indexing Strategy. Could you guys please review and let me know if they are fine and if I missed anything.
    
    I do not have any more pending tasks for ES based Indexing, unless I missed a review comment. Let me know what you guys think.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107670815
  
    --- Diff: jena-text/pom.xml ---
    @@ -81,6 +81,35 @@
           <artifactId>lucene-queryparser</artifactId>
         </dependency>
     
    +      <dependency>
    +          <groupId>org.elasticsearch</groupId>
    +          <artifactId>elasticsearch</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.elasticsearch.client</groupId>
    +          <artifactId>transport</artifactId>
    +      </dependency>
    +
    +      <!-- This is required to by pass ES JAR Hell in test environment-->
    --- End diff --
    
    The comment seems outdated, wasn't this related to the embedded ES tests that have now been removed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by kinow <gi...@git.apache.org>.

Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/227
  
    I think only the OP can close it, or a commit mentioning the PR number. Some time ago in Commons I think I had to close a PR, and I used an empty commit like this:
    
        git commit --allow-empty -m "This closes #123456"
    
    Not sure if that would work here though. Hope that helps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106198915
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            /**
    +             * We are creating an upsert request here instead of a simple insert request.
    +             * The reason is we want to add a document if it does not exist with the given Subject Id (URI).
    +             * But if the document exists with the same Subject Id, we want to do an update to it instead of deleting it and
    +             * then creating it with only the latest field values.
    +             * This functionality is called Upsert functionality and more can be learned about it here:
    +             * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
    +             */
    +
    +            //First Search of the field exists or not
    +            SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                    .setTypes(docDef.getEntityField())
    +                    .setQuery(QueryBuilders.existsQuery(fieldToAdd))
    +                    .get();
    +            String script;
    +            if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    --- End diff --
    
    Excellent comment. I will re-work on the code to make it safe and remove extra REST calls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106626513
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +
    +        String script = "if(ctx._source.<fieldToRemove> != null && (ctx._source.<fieldToRemove>.empty != true) " +
    +                "&& (ctx._source.<fieldToRemove>.indexOf('<valueToRemove>') >= 0)) " +
    +                "{ctx._source.<fieldToRemove>.remove(ctx._source.<fieldToRemove>.indexOf('<valueToRemove>'))}";
    +        script = script.replaceAll("<fieldToRemove>", fieldToRemove).replaceAll("<valueToRemove>", valueToRemove);
    +
    +        UpdateRequest updateRequest = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    +
    +        GetResponse response;
    +        Map<String, Node> result = new HashMap<>();
    +
    +        if(uri != null) {
    +            response = client.prepareGet(indexName, docDef.getEntityField(), uri).get();
    +            if(response != null && !response.isSourceEmpty()) {
    +                String entityField = response.getId();
    +                Node entity = NodeFactory.createURI(entityField) ;
    +                result.put(docDef.getEntityField(), entity);
    +                Map<String, Object> source = response.getSource();
    +                for (String field: docDef.fields()) {
    +                    Object fieldResponse = source.get(field);
    +
    +                    if(fieldResponse == null) {
    +                        //We wont return it.
    +                        continue;
    +                    }
    +                    else if(fieldResponse instanceof List<?>) {
    +                        //We are storing the values of fields as a List always.
    +                        //If there are values stored in the list, then we return the first value,
    +                        // else we do not include the field in the returned Map of Field -> Node Mapping
    +                        List<?> responseList = (List<?>)fieldResponse;
    +                        if(responseList != null && responseList.size() > 0) {
    +                            String fieldValue = (String)responseList.get(0);
    +                            Node fieldNode = NodeFactoryExtra.createLiteralNode(fieldValue, null, null);
    +                            result.put(field, fieldNode);
    +                        }
    +                    }
    +                }
    +            }
    +        }
    +
    +        return result;
    +    }
    +
    +    @Override
    +    public List<TextHit> query(Node property, String qs) {
    +
    +        return query(property, qs, MAX_RESULTS);
    +    }
    +
    +    /**
    +     * Query the ElasticSearch for the given Node, with the given query String and limit.
    +     * @param property the node property to make a search for
    +     * @param qs the query string
    +     * @param limit limit on the number of records to return
    +     * @return List of {@link TextHit}s containing the documents that have been found
    +     */
    +    @Override
    +    public List<TextHit> query(Node property, String qs, int limit) {
    +
    +        qs = parse(qs);
    +        LOGGER.debug("Querying ElasticSearch for QueryString: " + qs);
    +        SearchResponse response = client.prepareSearch(indexName)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.queryStringQuery(qs))
    +                .setFrom(0).setSize(limit)
    +                .get();
    +
    +        List<TextHit> results = new ArrayList<>() ;
    +        for (SearchHit hit : response.getHits()) {
    +
    +            Node literal;
    +            String field = (property != null) ? docDef.getField(property) : docDef.getPrimaryField();
    +            List<String> value = (List<String>)hit.getSource().get(field);
    +            if(value != null) {
    +                literal = NodeFactory.createLiteral(value.get(0));
    --- End diff --
    
    This is related to the jena-text functionality for [storing literal values](https://jena.apache.org/documentation/query/text-query.html#storing-literal-values). If you use a query like `(?s ?score ?literal) text:query 'word'`, then `?s` will be bound to the entity/subject, `?score` to the score returned by the index, and `?literal` to the original literal value. For the Lucene implementation, storing of literal valued has to be explicitly enabled for it to work, but for the ES backend, you need to keep the original values anyway because of the document-per-entity approach. My thought here was that since the ES index should know both the original value and the language tag, it should be possible to return a language-tagged value as the literal. 
    
    This would be useful for some applications; e.g. my application Skosmos relies on this feature (including the original language tags) to avoid extra TDB lookups and thus can use simpler SPARQL queries that also perform better than before this feature was implemented in jena-text.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106146265
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            /**
    +             * We are creating an upsert request here instead of a simple insert request.
    +             * The reason is we want to add a document if it does not exist with the given Subject Id (URI).
    +             * But if the document exists with the same Subject Id, we want to do an update to it instead of deleting it and
    +             * then creating it with only the latest field values.
    +             * This functionality is called Upsert functionality and more can be learned about it here:
    +             * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
    +             */
    +
    +            //First Search of the field exists or not
    +            SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                    .setTypes(docDef.getEntityField())
    +                    .setQuery(QueryBuilders.existsQuery(fieldToAdd))
    +                    .get();
    +            String script;
    +            if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    --- End diff --
    
    Updates now consist of two ES operations: first checking whether the entity+field exists, and then doing either an add or an update depending on the result. I wonder if a race condition is possible here, if many additions happen around the same time from multiple threads? I think that a single-operation atomic update would be preferable, if it is possible to implement with ES scripting. It would likely also perform much better, since the overhead of an extra ES HTTP request is likely quite significant considering that we are dealing with individual triples here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106401814
  
    --- Diff: jena-text/pom.xml ---
    @@ -81,6 +81,51 @@
           <artifactId>lucene-queryparser</artifactId>
         </dependency>
     
    +      <dependency>
    +          <groupId>org.elasticsearch</groupId>
    +          <artifactId>elasticsearch</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.elasticsearch.client</groupId>
    +          <artifactId>transport</artifactId>
    +      </dependency>
    +
    +      <dependency>
    --- End diff --
    
    Likewise here, the jena-test-framework dependency seems unnecessary to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106402500
  
    --- Diff: jena-text/src/main/java/examples/JenaESTextExample.java ---
    @@ -0,0 +1,64 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package examples;
    +
    +import org.apache.jena.query.Dataset;
    +import org.apache.jena.query.DatasetFactory;
    +
    +/**
    + * Simple example class to test the {@link org.apache.jena.query.text.assembler.TextIndexESAssembler}
    + * For this class to work properly, an elasticsearch node should be up and running, otherwise it will fail.
    + * You can find the details of downloading and running an ElasticSearch version here: https://www.elastic.co/downloads/past-releases/elasticsearch-5-2-1
    + * Unzip the file in your favourite directory and then execute the appropriate file under the bin directory.
    + * It will take less than a minute.
    + * In order to visualize what is written in ElasticSearch, you need to download and run Kibana: https://www.elastic.co/downloads/kibana
    + * To run kibana, just go to the bin directory and execute the appropriate file.
    + * We need to resort to this mechanism as ElasticSearch has stopped supporting embedded ElasticSearch.
    + *
    + * In addition we cant have it in the test package because ElasticSearch
    + * detects the thread origin and stops us from instantiating a client.
    + */
    +public class JenaESTextExample {
    --- End diff --
    
    Your commits seem to remove JenaTextExample.java on which this code is based. Probably you have renamed the file in a commit? Could you bring the original back?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    Thanks @osma. I think the Index will becoe much simpler if we remove the non-used methods


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106142603
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextDatasetFactory.java ---
    @@ -27,7 +27,7 @@
     import org.apache.jena.system.JenaSystem ;
     import org.apache.lucene.analysis.Analyzer;
     import org.apache.lucene.store.Directory ;
    -import org.apache.solr.client.solrj.SolrServer ;
    +import org.elasticsearch.indices.IndexCreationException;
    --- End diff --
    
    You should not need this import here. Exceptions in index creation should be thrown as TextIndexExceptions from within TextIndexES.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    Thanks @ajs6f for the Maven ElasticSearch Plugin link. Looks like this would enable us to spin up a fully functional Single Node ES for our integration tests. Can you share some more light as to how I can reuse my test as Integration test in Jena. Is it something specific or I use the standard Maven way of executing Integration tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    Hi @osma I need one more favour from you. I need to understand the scenario when the 'get' method of TextIndex gets called. Can you provide me an example Sparql query which I can run from my JenaESTextExample.java class that would result in calling the 'get' method?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @osma I have merged the changes from Master into my branch. 
    I am fine with merging the code on Monday/Tuesday. Can you also let me know when will 3.3.0 be released? Currently, to not stop us from using the ES functionality, I am maintaining a local branch of Jena where I have merged the changes from this branch. Obviously, I want to get rid of it ASAP and for that I need 3.3.0 from Apache Jena's maven repo. Is there a planned release coming up anytime soon?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on the issue:

    https://github.com/apache/jena/pull/227
  
    I think that if we are confident that we have solid test coverage and that it runs correctly, we can merge and clean up as time permits. I will take a look at the Maven setup and see if anything worries me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by afs <gi...@git.apache.org>.

Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106150756
  
    --- Diff: jena-parent/pom.xml ---
    @@ -258,37 +257,92 @@
             <version>${ver.lucene}</version>
           </dependency>
     
    -      <!-- Solr client -->
    -      <!-- Exclusion of slf4j: Necessary so as to pick the version we want. 
    -           solrj->zookeeper has a dependency on slf4j -->
    +      <!-- For jena-spatial -->
    +      <dependency>
    +        <groupId>org.apache.lucene</groupId>
    +        <artifactId>lucene-spatial</artifactId>
    +        <version>${ver.lucene}</version>
    +      </dependency>
    +
    +      <dependency>
    +        <groupId>org.apache.lucene</groupId>
    +        <artifactId>lucene-spatial-extras</artifactId>
    +        <version>${ver.lucene}</version>
    +      </dependency>
    +
    +      <dependency>
    +        <groupId>org.locationtech.spatial4j</groupId>
    +        <artifactId>spatial4j</artifactId>
    +        <version>${ver.spatial4j}</version>
    +      </dependency>
     
    +      <!-- ES dependencies-->
           <dependency>
    -        <artifactId>solr-solrj</artifactId>
    -        <groupId>org.apache.solr</groupId>
    -        <version>${ver.solr}</version>
    +        <groupId>org.elasticsearch</groupId>
    +        <artifactId>elasticsearch</artifactId>
    +        <version>${ver.elasticsearch}</version>
             <exclusions>
               <exclusion>
    -            <groupId>org.slf4j</groupId>
    -            <artifactId>slf4j-api</artifactId>
    +            <groupId>commons-logging</groupId>
    --- End diff --
    
    Jena replaces commons-logging with org.slf4j::jcl-over-slf4j, the SLF4F adapter.  As a consequence, we have to exclude it for dependencies that pull it in.  This isn't the only case - org.apache.httpcomponents::httpclient and others.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107681054
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,435 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.commons.lang3.exception.ExceptionUtils;
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.engine.DocumentMissingException;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'jena-text'
    +     */
    +    private final String indexName;
    +
    +    /**
    +     * The parameter representing the cluster name key
    +     */
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    /**
    +     * The parameter representing the number of shards key
    +     */
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    /**
    +     * The parameter representing the number of replicas key
    +     */
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    private static final String DASH = "-";
    +
    +    private static final String UNDERSCORE = "_";
    +
    +    private static final String COLON = ":";
    +
    +    private static final String ASTREIX = "*";
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106428563
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    --- End diff --
    
    So field name is not possible. Field Value is possible, but I do not have any mechanism to do a decent level of testing for the same reasons I cited below. I can actually test it by running a local instance of the cluster. Let me try that out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106151618
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/ESSettings.java ---
    @@ -0,0 +1,177 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.jena.query.text;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +/**
    + * Settings for ElasticSearch based indexing
    + */
    +public class ESSettings {
    +
    +    /**
    +     * Map of hosts and ports. The host could also be an IP Address
    +     */
    +    private Map<String,Integer> hostToPortMapping;
    +
    +    /**
    +     * Name of the Cluster. Defaults to 'elasticsearch'
    +     */
    +    private String clusterName;
    +
    +    /**
    +     * Number of shards. Defaults to '1'
    +     */
    +    private Integer shards;
    +
    +    /**
    +     * Number of replicas. Defaults to '1'
    +     */
    +    private Integer replicas;
    +
    +    /**
    +     * Name of the index. Defaults to 'test'
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106403936
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +
    +        String script = "if(ctx._source.<fieldToRemove> != null && (ctx._source.<fieldToRemove>.empty != true) " +
    +                "&& (ctx._source.<fieldToRemove>.indexOf('<valueToRemove>') >= 0)) " +
    +                "{ctx._source.<fieldToRemove>.remove(ctx._source.<fieldToRemove>.indexOf('<valueToRemove>'))}";
    +        script = script.replaceAll("<fieldToRemove>", fieldToRemove).replaceAll("<valueToRemove>", valueToRemove);
    --- End diff --
    
    Same comment as for `addEntity`: you should use ES named parameters here instead of `replaceAll`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106429441
  
    --- Diff: jena-text/src/main/java/examples/JenaESTextExample.java ---
    @@ -0,0 +1,64 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package examples;
    +
    +import org.apache.jena.query.Dataset;
    +import org.apache.jena.query.DatasetFactory;
    +
    +/**
    + * Simple example class to test the {@link org.apache.jena.query.text.assembler.TextIndexESAssembler}
    + * For this class to work properly, an elasticsearch node should be up and running, otherwise it will fail.
    + * You can find the details of downloading and running an ElasticSearch version here: https://www.elastic.co/downloads/past-releases/elasticsearch-5-2-1
    + * Unzip the file in your favourite directory and then execute the appropriate file under the bin directory.
    + * It will take less than a minute.
    + * In order to visualize what is written in ElasticSearch, you need to download and run Kibana: https://www.elastic.co/downloads/kibana
    + * To run kibana, just go to the bin directory and execute the appropriate file.
    + * We need to resort to this mechanism as ElasticSearch has stopped supporting embedded ElasticSearch.
    + *
    + * In addition we cant have it in the test package because ElasticSearch
    + * detects the thread origin and stops us from instantiating a client.
    + */
    +public class JenaESTextExample {
    +
    +    public static void main(String[] args) {
    +
    +        queryData(loadData(createAssembler()));
    +    }
    +
    +
    +    private static Dataset createAssembler() {
    +        String assemblerFile = "text-config-es.ttl";
    +        Dataset ds = DatasetFactory.assemble(assemblerFile,
    +                "http://localhost/jena_example/#text_dataset") ;
    +        return ds;
    +    }
    +
    +    private static Dataset loadData(Dataset ds) {
    +        JenaTextExample1.loadData(ds, "data-es.ttl");
    +        return ds;
    +    }
    +
    +    /**
    +     * Query Data
    +     * @param ds
    +     */
    +    private static void queryData(Dataset ds) {
    +        JenaTextExample1.queryData(ds);
    --- End diff --
    
    Actually since I am loading ES specific assembler and loading it into data set, it is fine actually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107681399
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/it/TextIndexESIT.java ---
    @@ -0,0 +1,282 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.jena.query.text.it;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.query.text.Entity;
    +import org.apache.jena.query.text.TextHit;
    +import org.apache.jena.vocabulary.RDFS;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.junit.Assert;
    +import org.junit.Test;
    +
    +import java.util.List;
    +import java.util.Map;
    +import java.util.concurrent.TimeUnit;
    +
    +/**
    + * Integration test class for {@link org.apache.jena.query.text.TextIndexES}
    + */
    +public class TextIndexESIT extends BaseESTest {
    +
    +    @Test
    +    public void testAddEntity() {
    +        String labelKey = "label";
    +        String labelValue = "this is a sample Label";
    +        Assert.assertNotNull(classToTest);
    +        Entity entityToAdd = entity("http://example/x3", labelKey, labelValue);
    +        GetResponse response = addEntity(entityToAdd);
    +        Assert.assertTrue(response.getSource().containsKey(labelKey));
    +        Assert.assertEquals(labelValue, ((List)response.getSource().get(labelKey)).get(0));
    +    }
    +
    +    @Test
    +    public void testDeleteEntity() {
    +        testAddEntity();
    +        String labelKey = "label";
    +        String labelValue = "this is a sample Label";
    +        //Now Delete the entity
    +        classToTest.deleteEntity(entity("http://example/x3", labelKey, labelValue));
    +
    +        //Try to find it
    +        GetResponse response = transportClient.prepareGet(INDEX_NAME, DOC_TYPE, "http://example/x3").get();
    +        //It Should Exist
    +        Assert.assertTrue(response.isExists());
    +        //But the field value should now be empty
    +        Assert.assertEquals("http://example/x3", response.getId());
    +        Assert.assertTrue(response.getSource().containsKey(labelKey));
    +        Assert.assertEquals(0, ((List)response.getSource().get(labelKey)).size());
    +    }
    +
    +    @Test
    +    public void testDeleteWhenNoneExists() {
    +
    +        GetResponse response = transportClient.prepareGet(INDEX_NAME, DOC_TYPE, "http://example/x3").get();
    +        Assert.assertFalse(response.isExists());
    +        Assert.assertNotNull(classToTest);
    +        classToTest.deleteEntity(entity("http://example/x3", "label", "doesnt matter"));
    +        response = transportClient.prepareGet(INDEX_NAME, DOC_TYPE, "http://example/x3").get();
    +        Assert.assertFalse(response.isExists());
    +
    +    }
    +
    +    @Test
    +    public void testQuery() {
    +        testAddEntity();
    +        // This will search for value "this" across all the fields in all the documents
    +        List<TextHit> result =  classToTest.query(RDFS.label.asNode(), "this", 10);
    +        Assert.assertNotNull(result);
    +        Assert.assertEquals(1, result.size());
    +
    +        //This will search for value "this" only in the label field
    +        result =  classToTest.query(RDFS.label.asNode(), "label:this", 10);
    +        Assert.assertNotNull(result);
    +        Assert.assertEquals(1, result.size());
    +
    +        //This will search for value "this" in the label_en field, if it exists. In this case it doesnt so we should get zero results
    +        result =  classToTest.query(RDFS.label.asNode(), "label:this AND lang:en", 10);
    +        Assert.assertNotNull(result);
    +        Assert.assertEquals(0, result.size());
    +
    +    }
    +
    +    @Test
    +    public void testQueryWhenNoneExists() {
    +        List<TextHit> result =  classToTest.query(RDFS.label.asNode(), "this", 1);
    +        Assert.assertNotNull(result);
    +        Assert.assertEquals(0, result.size());
    +    }
    +
    +    @Test
    +    public void testGet() {
    +        testAddEntity();
    +        //Now Get the same entity
    +        Map<String, Node> response = classToTest.get("http://example/x3");
    +        Assert.assertNotNull(response);
    +        Assert.assertEquals(2, response.size());
    +    }
    +
    +    @Test
    +    public void testGetWhenNoneExists() {
    +        Map<String, Node> response = classToTest.get("http://example/x3");
    +        Assert.assertNotNull(response);
    +        Assert.assertEquals(0, response.size());
    +    }
    +
    +    /**
    +     * This is an elaborate test that does the following:
    +     * 1. Create a Document with ID: "http://example/x3" , label: Germany and lang:en
    +     * 2. Makes sure the document is created successfully and is searchable based on the label
    +     * 3. Next add another label to the same Entity with ID: "http://example/x3", label:Deutschland and lang:de
    +     * 4. Makes sure that the document is searchable both with old (Germany) and new (Deutschland) values.
    +     * 5. Next, it deletes the value: Germany created in step 1.
    +     * 6. Makes sure that document is searchable with value: Deutschland but NOT with value: Germany
    +     * 7. Finally, delete the value: Deutschland
    +     * 8. The document should not be searchable with value: Deutschland
    +     * 9. The document should still exist
    +     */
    +    @Test
    +    public void testMultipleValuesinMultipleLanguages() throws InterruptedException{
    +        addEntity(entity("http://example/x3", "label", "Germany", "en"));
    +        List<TextHit> result =  classToTest.query(RDFS.label.asNode(), "Germany", 10);
    +        Assert.assertNotNull(result);
    +        Assert.assertEquals(1, result.size());
    +        Assert.assertEquals("http://example/x3", result.get(0).getNode().getURI());
    +        //Next add another label to the same entity
    +        addEntity(entity("http://example/x3", "label", "Deutschland", "de"));
    +        //Query with old value
    +        result =  classToTest.query(RDFS.label.asNode(), "Germany", 10);
    +        Assert.assertEquals(1, result.size());
    +        Assert.assertEquals("http://example/x3", result.get(0).getNode().getURI());
    +
    +        //Query with new value
    +        result =  classToTest.query(RDFS.label.asNode(), "Deutschland", 10);
    +        Assert.assertEquals(1, result.size());
    +        Assert.assertEquals("http://example/x3", result.get(0).getNode().getURI());
    +
    +        //Now lets delete the German label
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107672655
  
    --- Diff: jena-text/src/main/java/examples/JenaESTextExample.java ---
    @@ -0,0 +1,94 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package examples;
    +
    +import org.apache.jena.atlas.lib.StrUtils;
    +import org.apache.jena.query.*;
    +import org.apache.jena.sparql.util.QueryExecUtils;
    +
    +/**
    + * Simple example class to test the {@link org.apache.jena.query.text.assembler.TextIndexESAssembler}
    + * For this class to work properly, an elasticsearch node should be up and running, otherwise it will fail.
    + * You can find the details of downloading and running an ElasticSearch version here: https://www.elastic.co/downloads/past-releases/elasticsearch-5-2-1
    + * Unzip the file in your favourite directory and then execute the appropriate file under the bin directory.
    + * It will take less than a minute.
    + * In order to visualize what is written in ElasticSearch, you need to download and run Kibana: https://www.elastic.co/downloads/kibana
    + * To run kibana, just go to the bin directory and execute the appropriate file.
    + * We need to resort to this mechanism as ElasticSearch has stopped supporting embedded ElasticSearch.
    + *
    + * In addition we cant have it in the test package because ElasticSearch
    + * detects the thread origin and stops us from instantiating a client.
    + */
    +public class JenaESTextExample {
    +
    +    public static void main(String[] args) {
    +
    +        queryData(loadData(createAssembler()));
    +    }
    +
    +
    +    private static Dataset createAssembler() {
    +        String assemblerFile = "text-config-es.ttl";
    +        Dataset ds = DatasetFactory.assemble(assemblerFile,
    +                "http://localhost/jena_example/#text_dataset") ;
    +        return ds;
    +    }
    +
    +    private static Dataset loadData(Dataset ds) {
    +        JenaTextExample1.loadData(ds, "data-es.ttl");
    +        return ds;
    +    }
    +
    +    /**
    +     * Query Data
    +     * @param ds
    +     */
    +    private static void queryData(Dataset ds) {
    +//        JenaTextExample1.queryData(ds);
    --- End diff --
    
    please remove unnecessary lines instead of just commenting them out, unless there is a good reason to keep them


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107681126
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/assembler/TextIndexESAssembler.java ---
    @@ -0,0 +1,129 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text.assembler;
    +
    +import org.apache.jena.assembler.Assembler;
    +import org.apache.jena.assembler.Mode;
    +import org.apache.jena.assembler.assemblers.AssemblerBase;
    +import org.apache.jena.query.text.*;
    +import org.apache.jena.rdf.model.RDFNode;
    +import org.apache.jena.rdf.model.Resource;
    +import org.apache.jena.rdf.model.Statement;
    +import org.apache.jena.sparql.util.graph.GraphUtils;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +import static org.apache.jena.query.text.assembler.TextVocab.*;
    +
    +public class TextIndexESAssembler extends AssemblerBase {
    +
    +    private static Logger LOGGER      = LoggerFactory.getLogger(TextIndexESAssembler.class) ;
    +
    +    protected static final String COMMA = ",";
    +    protected static final String COLON = ":";
    +    /*
    +    <#index> a :TextIndexES ;
    +        text:serverList "127.0.0.1:9300,127.0.0.2:9400,127.0.0.3:9500" ; #Comma separated list of hosts:ports
    +        text:clusterName "elasticsearch"
    +        text:shards "1"
    +        text:replicas "1"
    +        text:entityMap <#endMap> ;
    +        .
    +    */
    +    
    +    @SuppressWarnings("resource")
    +    @Override
    +    public TextIndex open(Assembler a, Resource root, Mode mode) {
    +        try {
    +            String listOfHostsAndPorts = GraphUtils.getAsStringValue(root, pServerList) ;
    +            if(listOfHostsAndPorts == null || listOfHostsAndPorts.isEmpty()) {
    +                throw new TextIndexException("Mandatory property text:serverList (containing the comma-separated list of host:port) property is not specified. " +
    +                        "An example value for the property: 127.0.0.1:9300");
    +            }
    +            String[] hosts = listOfHostsAndPorts.split(COMMA);
    +            Map<String,Integer> hostAndPortMapping = new HashMap<>();
    +            for(String host : hosts) {
    +                String[] hostAndPort = host.split(COLON);
    +                if(hostAndPort.length < 2) {
    +                    LOGGER.error("Either the host or the port value is missing.Please specify the property in host:port format. " +
    +                            "Both parts are mandatory. Ignoring this value. Moving to the next one.");
    +                    continue;
    +                }
    +                hostAndPortMapping.put(hostAndPort[0], Integer.valueOf(hostAndPort[1]));
    +            }
    +
    +            String clusterName = GraphUtils.getAsStringValue(root, pClusterName);
    +            if(clusterName == null || clusterName.isEmpty()) {
    +                LOGGER.warn("ClusterName property is not specified. Defaulting to 'elasticsearch'");
    +                clusterName = "elasticsearch";
    +            }
    +
    +            String numberOfShards = GraphUtils.getAsStringValue(root, pShards);
    +            if(numberOfShards == null || numberOfShards.isEmpty()) {
    +                LOGGER.warn("shards property is not specified. Defaulting to '1'");
    +                numberOfShards = "1";
    +            }
    +
    +            String replicationFactor = GraphUtils.getAsStringValue(root, pReplicas);
    +            if(replicationFactor == null || replicationFactor.isEmpty()) {
    +                LOGGER.warn("replicas property is not specified. Defaulting to '1'");
    +                replicationFactor = "1";
    +            }
    +
    +            String indexName = GraphUtils.getAsStringValue(root, pIndexName);
    +            if(indexName == null || indexName.isEmpty()) {
    +                LOGGER.warn("index Name property is not specified. Defaulting to 'jena-text'");
    +                indexName = "jena-text";
    +            }
    +
    +            boolean isMultilingualSupport = false;
    +            Statement mlSupportStatement = root.getProperty(pMultilingualSupport);
    +            if (null != mlSupportStatement) {
    +                RDFNode mlsNode = mlSupportStatement.getObject();
    +                if (! mlsNode.isLiteral()) {
    +                    throw new TextIndexException("text:multilingualSupport property must be a string : " + mlsNode);
    +                }
    +                isMultilingualSupport = mlsNode.asLiteral().getBoolean();
    +            }
    +
    +
    +
    +            Resource r = GraphUtils.getResourceValue(root, pEntityMap) ;
    +            EntityDefinition docDef = (EntityDefinition)a.open(r) ;
    +            TextIndexConfig config = new TextIndexConfig(docDef);
    +            config.setMultilingualSupport(isMultilingualSupport);
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107701838
  
    --- Diff: jena-text/pom.xml ---
    @@ -112,11 +141,73 @@
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-surefire-plugin</artifactId>
             <configuration>
    -          <includes>
    -            <include>**/TS_*.java</include>
    -          </includes>
    +            <!-- Skip the default running of this plug-in (or everything is run twice...see below) -->
    +            <skip>true</skip>
    +            <!--<excludedGroups>org.apache.jena.query.text.IntegrationTest</excludedGroups>-->
             </configuration>
    +          <executions>
    +              <execution>
    +                  <id>unit-tests</id>
    +                  <phase>test</phase>
    +                  <goals>
    +                      <goal>test</goal>
    +                  </goals>
    +                  <configuration>
    +                      <skip>false</skip>
    +                      <includes>
    +                          <include>**/TS_*.java</include>
    +                      </includes>
    +                      <excludes>
    +                          <exclude>**/*IT.java</exclude>
    +                      </excludes>
    +                  </configuration>
    +              </execution>
    +              <execution>
    +                  <id>integration-tests</id>
    +                  <phase>integration-test</phase>
    +                  <goals>
    +                      <goal>test</goal>
    +                  </goals>
    +                  <configuration>
    +                      <skip>false</skip>
    +                      <includes>
    +                          <include>**/*IT.java</include>
    +                      </includes>
    +                  </configuration>
    +              </execution>
    +          </executions>
           </plugin>
    +        <plugin>
    +            <groupId>com.github.alexcojocaru</groupId>
    +            <artifactId>elasticsearch-maven-plugin</artifactId>
    +            <!-- REPLACE THE FOLLOWING WITH THE PLUGIN VERSION YOU NEED -->
    +            <version>5.2</version>
    +            <configuration>
    +                <clusterName>elasticsearch</clusterName>
    +                <tcpPort>9300</tcpPort>
    --- End diff --
    
    Just found a bug in the Maven ES Plugin. The TCP port is ALWAYS defaulted to 9300 no matter whether you specify it as config or not. Thus reverting back the TCP port to 9300


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106142321
  
    --- Diff: jena-text/src/main/java/examples/JenaTextExample1.java ---
    @@ -41,9 +41,9 @@
         
         public static void main(String ... argv)
         {
    -        Dataset ds = createCode() ;
    -        //Dataset ds = createAssembler() ;
    -        loadData(ds , "data.ttl") ;
    +//        Dataset ds = createCode() ;
    +        Dataset ds = createAssembler() ;
    --- End diff --
    
    You have changed the existing JenaTextExample1.java that was an example of how to use jena-text/Lucene. Make a  copy instead, leaving the original intact, since we need to have an example of how to configure a Lucene index from code. The new class could be called JenaTextESExample1 or similar.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r108438002
  
    --- Diff: jena-text/pom.xml ---
    @@ -112,11 +138,72 @@
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-surefire-plugin</artifactId>
             <configuration>
    -          <includes>
    -            <include>**/TS_*.java</include>
    -          </includes>
    +            <!-- Skip the default running of this plug-in (or everything is run twice...see below) -->
    +            <skip>true</skip>
             </configuration>
    +          <executions>
    +              <execution>
    +                  <id>unit-tests</id>
    +                  <phase>test</phase>
    +                  <goals>
    +                      <goal>test</goal>
    +                  </goals>
    +                  <configuration>
    +                      <skip>false</skip>
    +                      <includes>
    +                          <include>**/TS_*.java</include>
    +                      </includes>
    +                      <excludes>
    +                          <exclude>**/*IT.java</exclude>
    +                      </excludes>
    +                  </configuration>
    +              </execution>
    +              <execution>
    +                  <id>integration-tests</id>
    +                  <phase>integration-test</phase>
    +                  <goals>
    +                      <goal>test</goal>
    +                  </goals>
    +                  <configuration>
    +                      <skip>false</skip>
    +                      <includes>
    +                          <include>**/*IT.java</include>
    +                      </includes>
    +                  </configuration>
    +              </execution>
    +          </executions>
           </plugin>
    +        <plugin>
    +            <groupId>com.github.alexcojocaru</groupId>
    +            <artifactId>elasticsearch-maven-plugin</artifactId>
    +            <!-- REPLACE THE FOLLOWING WITH THE PLUGIN VERSION YOU NEED -->
    --- End diff --
    
    What does this comment mean?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106147746
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            /**
    +             * We are creating an upsert request here instead of a simple insert request.
    +             * The reason is we want to add a document if it does not exist with the given Subject Id (URI).
    +             * But if the document exists with the same Subject Id, we want to do an update to it instead of deleting it and
    +             * then creating it with only the latest field values.
    +             * This functionality is called Upsert functionality and more can be learned about it here:
    +             * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
    +             */
    +
    +            //First Search of the field exists or not
    +            SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                    .setTypes(docDef.getEntityField())
    +                    .setQuery(QueryBuilders.existsQuery(fieldToAdd))
    +                    .get();
    +            String script;
    +            if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +                //This means field already exists and therefore we should append to it
    +                script = "ctx._source." + fieldToAdd+".add('"+ fieldValueToAdd + "')";
    +            } else {
    +                //The field does not exists. so we create one
    +                script = "ctx._source." + fieldToAdd+" =['"+ fieldValueToAdd + "']";
    +            }
    +
    +
    +
    +            UpdateRequest upReq = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(script))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +
    +
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +        //First Search of the field exists or not
    +        SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.existsQuery(fieldToRemove))
    +                .get();
    +
    +        String script = null;
    +        if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +            //This means field already exists and therefore we should remove from it
    +            script = "ctx._source." + fieldToRemove+".remove('"+ valueToRemove + "')";
    +        }
    +
    +        UpdateRequest updateRequest = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +//        client.prepareDelete(INDEX_NAME, docDef.getEntityField(), entity.getId()).get();
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    +
    +        GetResponse response;
    +        Map<String, Node> result = new HashMap<>();
    +
    +        if(uri != null) {
    +            response = client.prepareGet(INDEX_NAME, docDef.getEntityField(), uri).get();
    +            if(response != null && !response.isSourceEmpty()) {
    +                String entityField = response.getId();
    +                Node entity = NodeFactory.createURI(entityField) ;
    +                result.put(docDef.getEntityField(), entity);
    +                for (String field: docDef.fields()) {
    +
    +                    GetField fieldResponse = response.getField(field);
    +
    +                    if(fieldResponse == null || fieldResponse.getValue() == null) {
    +                        //We wont return it.
    +                        continue;
    +                    }
    +                    if(fieldResponse instanceof List<?>) {
    +                        //We are only interested in literal values
    +                        continue;
    +                    }
    +                    //We assume it will always be a String value.
    +                    String fieldValue = (String)fieldResponse.getValue();
    +                    Node fieldNode = NodeFactoryExtra.createLiteralNode(fieldValue, null, null);
    --- End diff --
    
    Would it be possible to return language-tagged values here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106404596
  
    --- Diff: jena-text/testing/TextQuery/text-config.ttl ---
    @@ -31,6 +31,7 @@ text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
     
     <#indexLucene> a text:TextIndexLucene ;
         text:directory "mem" ;
    +    text:multilingualSupport true ;
    --- End diff --
    
    Is this change necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106141294
  
    --- Diff: jena-text/pom.xml ---
    @@ -81,39 +81,50 @@
           <artifactId>lucene-queryparser</artifactId>
         </dependency>
     
    -    <!-- Solr client -->
    -    <dependency>
    -      <artifactId>solr-solrj</artifactId>
    -      <groupId>org.apache.solr</groupId>
    -    </dependency>
    -
    -    <!-- Embedded server if used for testing
    -    <dependency>
    -      <artifactId>solr-core</artifactId>
    -      <groupId>org.apache.solr</groupId>
    -      <version>${ver.solr}</version>
    -      <type>jar</type>
    -      <scope>test</scope>
    -      <optional>true</optional>
    -      <exclusions>
    -        <exclusion>
    -          <groupId>org.slf4j</groupId>
    -          <artifactId>slf4j-api</artifactId>
    -        </exclusion>
    -        <exclusion>
    -          <groupId>org.slf4j</groupId>
    -          <artifactId>slf4j-jdk14</artifactId>
    -        </exclusion>
    -      </exclusions>
    -    </dependency>
    -
    -    <dependency>
    -      <groupId>javax.servlet</groupId>
    -      <artifactId>servlet-api</artifactId>
    -      <version>2.5</version>
    -      <scope>test</scope>
    -    </dependency>
    -    -->
    +      <dependency>
    +          <groupId>org.elasticsearch</groupId>
    +          <artifactId>elasticsearch</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.elasticsearch.client</groupId>
    +          <artifactId>transport</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.apache.lucene</groupId>
    +          <artifactId>lucene-test-framework</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.elasticsearch.test</groupId>
    +          <artifactId>framework</artifactId>
    +      </dependency>
    +
    +      <!-- This is required to by pass ES JAR Hell in test environment-->
    +      <dependency>
    +          <groupId>junit</groupId>
    +          <artifactId>junit</artifactId>
    +          <exclusions>
    +              <exclusion>
    +                  <groupId>org.hamcrest</groupId>
    +                  <artifactId>hamcrest-core</artifactId>
    +              </exclusion>
    +          </exclusions>
    +      </dependency>
    +
    +      <dependency>
    --- End diff --
    
    The log4j dependency is specified twice, one should be enough!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106144352
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    --- End diff --
    
    A comment clarifying that these are ES parameter names and not the values itself would be helpful here. Could also consider renaming to e,g. CLUSTER_NAME_PARAM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106143930
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/assembler/TextAssembler.java ---
    @@ -29,14 +29,15 @@ public static void init()
             AssemblerUtils.registerDataset(TextVocab.textDataset,      new TextDatasetAssembler()) ;
             
             Assembler.general.implementWith(TextVocab.entityMap,        new EntityDefinitionAssembler()) ;
    -        Assembler.general.implementWith(TextVocab.textIndexSolr,    new TextIndexSolrAssembler()) ; 
             Assembler.general.implementWith(TextVocab.textIndexLucene,  new TextIndexLuceneAssembler()) ;
             Assembler.general.implementWith(TextVocab.standardAnalyzer, new StandardAnalyzerAssembler()) ;
             Assembler.general.implementWith(TextVocab.simpleAnalyzer,   new SimpleAnalyzerAssembler()) ;
             Assembler.general.implementWith(TextVocab.keywordAnalyzer,  new KeywordAnalyzerAssembler()) ;
             Assembler.general.implementWith(TextVocab.lowerCaseKeywordAnalyzer, new LowerCaseKeywordAnalyzerAssembler()) ;
             Assembler.general.implementWith(TextVocab.localizedAnalyzer, new LocalizedAnalyzerAssembler()) ;
             Assembler.general.implementWith(TextVocab.configurableAnalyzer, new ConfigurableAnalyzerAssembler()) ;
    +        Assembler.general.implementWith(TextVocab.textIndexES,  new TextIndexESAssembler()) ;
    --- End diff --
    
    minor style issue, but please move this next to TextIndexLuceneAssembler a few lines up, because they're similar


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @osma Can you review now. I have made the changes to the Add and Delete API so that they are executed as a single REST call. I think we already have consensus on the Multilingual aspect. If not, please let me know and we can have a discussion around it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    I asked advice about what to do with those methods on the `dev` list: http://jena.markmail.org/thread/pgeigsya7f5xda6h


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106237495
  
    --- Diff: jena-text/src/main/resources/text-config-es.ttl ---
    @@ -0,0 +1,65 @@
    +    # Licensed to the Apache Software Foundation (ASF) under one
    +    # or more contributor license agreements.  See the NOTICE file
    +    # distributed with this work for additional information
    +    # regarding copyright ownership.  The ASF licenses this file
    +    # to you under the Apache License, Version 2.0 (the
    +    # "License"); you may not use this file except in compliance
    +    # with the License.  You may obtain a copy of the License at
    +    #
    +    #     http://www.apache.org/licenses/LICENSE-2.0
    +    #
    +    # Unless required by applicable law or agreed to in writing, software
    +    # distributed under the License is distributed on an "AS IS" BASIS,
    +    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    # See the License for the specific language governing permissions and
    +    # limitations under the License.
    +
    + ## Example of a TDB dataset and text index for ElasticSearch
    +
    +@prefix :        <http://localhost/jena_example/#> .
    +@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    +@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
    +@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
    +@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
    +@prefix text:    <http://jena.apache.org/text#> .
    +
    +# TDB
    +[] ja:loadClass "org.apache.jena.tdb.TDB" .
    +tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
    +tdb:GraphTDB    rdfs:subClassOf  ja:Model .
    +
    +# Text
    +[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
    +text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
    +text:TextIndexES      rdfs:subClassOf   text:TextIndex .
    +
    +## ---------------------------------------------------------------
    +## This URI must be fixed - it's used to assemble the text dataset.
    +
    +:text_dataset rdf:type     text:TextDataset ;
    +    text:dataset   <#dataset> ;
    +    text:index     <#indexES> ;
    +    .
    +
    +<#dataset> rdf:type      tdb:DatasetTDB ;
    +    tdb:location "--mem--" ;
    +    .
    +
    +<#indexES> a text:TextIndexES ;
    +    text:serverList "127.0.0.1:9300" ; # A comma-separated list of Host:Port values of the ElasticSearch Cluster nodes.
    +    text:clusterName "elasticsearch" ; # Name of the ElasticSearch Cluster. If not specified defaults to 'elasticsearch'
    +    text:shards "1" ;                  # The number of shards for the index. Defaults to 1
    +    text:replicas "1" ;                # The number of replicas for the index. Defaults to 1
    +    text:indexName "jena-text" ;       # Name of the Index. defaults to jena-text
    +    text:multilingualSupport true ;
    +    text:entityMap <#entMap> ;
    +    .
    +
    +<#entMap> a text:EntityMap ;
    +    text:entityField      "intel" ; # Defines the Document Type in the ES Index
    --- End diff --
    
    I see, thanks for the explanation!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107680990
  
    --- Diff: jena-text/pom.xml ---
    @@ -81,6 +81,35 @@
           <artifactId>lucene-queryparser</artifactId>
         </dependency>
     
    +      <dependency>
    +          <groupId>org.elasticsearch</groupId>
    +          <artifactId>elasticsearch</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.elasticsearch.client</groupId>
    +          <artifactId>transport</artifactId>
    +      </dependency>
    +
    +      <!-- This is required to by pass ES JAR Hell in test environment-->
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106156573
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            /**
    +             * We are creating an upsert request here instead of a simple insert request.
    +             * The reason is we want to add a document if it does not exist with the given Subject Id (URI).
    +             * But if the document exists with the same Subject Id, we want to do an update to it instead of deleting it and
    +             * then creating it with only the latest field values.
    +             * This functionality is called Upsert functionality and more can be learned about it here:
    +             * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
    +             */
    +
    +            //First Search of the field exists or not
    +            SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                    .setTypes(docDef.getEntityField())
    +                    .setQuery(QueryBuilders.existsQuery(fieldToAdd))
    +                    .get();
    +            String script;
    +            if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +                //This means field already exists and therefore we should append to it
    +                script = "ctx._source." + fieldToAdd+".add('"+ fieldValueToAdd + "')";
    +            } else {
    +                //The field does not exists. so we create one
    +                script = "ctx._source." + fieldToAdd+" =['"+ fieldValueToAdd + "']";
    +            }
    +
    +
    +
    +            UpdateRequest upReq = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(script))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +
    +
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +        //First Search of the field exists or not
    +        SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.existsQuery(fieldToRemove))
    +                .get();
    +
    +        String script = null;
    +        if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +            //This means field already exists and therefore we should remove from it
    +            script = "ctx._source." + fieldToRemove+".remove('"+ valueToRemove + "')";
    +        }
    +
    +        UpdateRequest updateRequest = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +//        client.prepareDelete(INDEX_NAME, docDef.getEntityField(), entity.getId()).get();
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    This is the error I am getting
    ```
    Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 11.253 sec <<< FAILURE! - in org.apache.jena.query.text.it.TextIndexESIT
    org.apache.jena.query.text.it.TextIndexESIT  Time elapsed: 11.253 sec  <<< ERROR!
    java.lang.IllegalStateException: running tests but failed to invoke RandomizedContext#getRandom
    Caused by: java.lang.reflect.InvocationTargetException
    Caused by: java.lang.IllegalStateException: No context information for thread: Thread[id=1, name=main, state=RUNNABLE, group=main]. Is this thread running under a class com.carrotsearch.randomizedtesting.RandomizedRunner runner context? Add @RunWith(class com.carrotsearch.randomizedtesting.RandomizedRunner.class) to your test class. Make sure your code accesses random contexts within @BeforeClass and @AfterClass boundary (for example, static test class initializers are not permitted to access random contexts).
    ```
    
    Just to rule out any local interference from my side, I have checked in a Simple Test based on the ES Maven Plugin. Can you try that out on your side @osma and see if you get the same error as I am?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106402651
  
    --- Diff: jena-text/src/main/java/examples/JenaESTextExample.java ---
    @@ -0,0 +1,64 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package examples;
    +
    +import org.apache.jena.query.Dataset;
    +import org.apache.jena.query.DatasetFactory;
    +
    +/**
    + * Simple example class to test the {@link org.apache.jena.query.text.assembler.TextIndexESAssembler}
    + * For this class to work properly, an elasticsearch node should be up and running, otherwise it will fail.
    + * You can find the details of downloading and running an ElasticSearch version here: https://www.elastic.co/downloads/past-releases/elasticsearch-5-2-1
    + * Unzip the file in your favourite directory and then execute the appropriate file under the bin directory.
    + * It will take less than a minute.
    + * In order to visualize what is written in ElasticSearch, you need to download and run Kibana: https://www.elastic.co/downloads/kibana
    + * To run kibana, just go to the bin directory and execute the appropriate file.
    + * We need to resort to this mechanism as ElasticSearch has stopped supporting embedded ElasticSearch.
    + *
    + * In addition we cant have it in the test package because ElasticSearch
    + * detects the thread origin and stops us from instantiating a client.
    + */
    +public class JenaESTextExample {
    +
    +    public static void main(String[] args) {
    +
    +        queryData(loadData(createAssembler()));
    +    }
    +
    +
    +    private static Dataset createAssembler() {
    +        String assemblerFile = "text-config-es.ttl";
    +        Dataset ds = DatasetFactory.assemble(assemblerFile,
    +                "http://localhost/jena_example/#text_dataset") ;
    +        return ds;
    +    }
    +
    +    private static Dataset loadData(Dataset ds) {
    +        JenaTextExample1.loadData(ds, "data-es.ttl");
    +        return ds;
    +    }
    +
    +    /**
    +     * Query Data
    +     * @param ds
    +     */
    +    private static void queryData(Dataset ds) {
    +        JenaTextExample1.queryData(ds);
    --- End diff --
    
    This calls into JenaTextExample1.java which, after you reverted it, uses Lucene. I think you need to have an ES version of that class as well, something like JenaTextESExample1.java, and call that instead here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106146962
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            /**
    +             * We are creating an upsert request here instead of a simple insert request.
    +             * The reason is we want to add a document if it does not exist with the given Subject Id (URI).
    +             * But if the document exists with the same Subject Id, we want to do an update to it instead of deleting it and
    +             * then creating it with only the latest field values.
    +             * This functionality is called Upsert functionality and more can be learned about it here:
    +             * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
    +             */
    +
    +            //First Search of the field exists or not
    +            SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                    .setTypes(docDef.getEntityField())
    +                    .setQuery(QueryBuilders.existsQuery(fieldToAdd))
    +                    .get();
    +            String script;
    +            if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +                //This means field already exists and therefore we should append to it
    +                script = "ctx._source." + fieldToAdd+".add('"+ fieldValueToAdd + "')";
    +            } else {
    +                //The field does not exists. so we create one
    +                script = "ctx._source." + fieldToAdd+" =['"+ fieldValueToAdd + "']";
    +            }
    +
    +
    +
    +            UpdateRequest upReq = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(script))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +
    +
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +        //First Search of the field exists or not
    +        SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.existsQuery(fieldToRemove))
    +                .get();
    +
    +        String script = null;
    +        if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +            //This means field already exists and therefore we should remove from it
    +            script = "ctx._source." + fieldToRemove+".remove('"+ valueToRemove + "')";
    --- End diff --
    
    Am I right that this removes the individual field value, but if this happens to be the last value of the last field for that entity, the ES document for the entity will still be left in the index? So the entity would never be deleted completely.
    I don't think that's too bad but just something to be aware of as it may lead to slight index growth over time. The Lucene index does clean up documents completely after values are removed but it is different since it operates on a triple/quad level.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106451444
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    --- End diff --
    
    Also, what happens if the value contains a single quote? e.g. `:c rdfs:label "it's complicated"`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106146566
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            /**
    +             * We are creating an upsert request here instead of a simple insert request.
    +             * The reason is we want to add a document if it does not exist with the given Subject Id (URI).
    +             * But if the document exists with the same Subject Id, we want to do an update to it instead of deleting it and
    +             * then creating it with only the latest field values.
    +             * This functionality is called Upsert functionality and more can be learned about it here:
    +             * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
    +             */
    +
    +            //First Search of the field exists or not
    +            SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                    .setTypes(docDef.getEntityField())
    +                    .setQuery(QueryBuilders.existsQuery(fieldToAdd))
    +                    .get();
    +            String script;
    +            if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +                //This means field already exists and therefore we should append to it
    +                script = "ctx._source." + fieldToAdd+".add('"+ fieldValueToAdd + "')";
    +            } else {
    +                //The field does not exists. so we create one
    +                script = "ctx._source." + fieldToAdd+" =['"+ fieldValueToAdd + "']";
    +            }
    +
    +
    +
    +            UpdateRequest upReq = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(script))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +
    +
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +        //First Search of the field exists or not
    +        SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.existsQuery(fieldToRemove))
    +                .get();
    +
    +        String script = null;
    +        if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +            //This means field already exists and therefore we should remove from it
    +            script = "ctx._source." + fieldToRemove+".remove('"+ valueToRemove + "')";
    +        }
    +
    +        UpdateRequest updateRequest = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +//        client.prepareDelete(INDEX_NAME, docDef.getEntityField(), entity.getId()).get();
    --- End diff --
    
    Please remove commented out lines like this, unless they have a clear purpose and explanation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107437003
  
    --- Diff: jena-text/pom.xml ---
    @@ -112,11 +141,77 @@
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-surefire-plugin</artifactId>
             <configuration>
    -          <includes>
    -            <include>**/TS_*.java</include>
    -          </includes>
    +            <!-- Skip the default running of this plug-in (or everything is run twice...see below) -->
    +            <skip>true</skip>
    +
    +            <!-- Required to bypass Embedded ES security checks, especially JAR Hell-->
    +            <argLine>-Dtests.security.manager=false</argLine>
    --- End diff --
    
    Good catch. I will remove it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106403179
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    --- End diff --
    
    I don't think that the distinction between multilingual and non-multilingual mode for the ES backend makes sense. You are using a document-per-entity model that requires you to track also language tags (using separate fields), and for that to work properly, what you currently call multilingual mode is necessary (otherwise, you cannot handle the "remove one of the Berlin labels but not all of them" case). Also, this is still different from the multilingual mode in the Lucene backend, which also uses language-specific analyzers.
    
    I suggest that you remove the `isMultilingual` attribute and change the code so that it always works as if it were true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106404542
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/TestTextIndexES.java ---
    @@ -0,0 +1,184 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.jena.query.text;
    +
    +
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.vocabulary.RDFS;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.test.ESIntegTestCase;
    +import org.junit.Assert;
    +import org.junit.Ignore;
    +import org.junit.Test;
    +
    +import java.util.List;
    +import java.util.Map;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + *
    + * Integration test for {@link TextIndexES} class
    + * ES Integration test depends on security policies that may sometime not be loaded properly.
    + * If you find any issues regarding security set the following VM argument to resolve the issue:
    + * -Dtests.security.manager=false
    + *
    + */
    +@ESIntegTestCase.ClusterScope()
    +public class TestTextIndexES extends ESIntegTestCase {
    --- End diff --
    
    I still think this is a problem, could you take a closer look? We can also ask advice on the `dev` list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106401743
  
    --- Diff: jena-parent/pom.xml ---
    @@ -275,6 +276,75 @@
             <version>${ver.spatial4j}</version>
           </dependency>
     
    +      <!-- ES dependencies-->
    +      <dependency>
    +        <groupId>org.elasticsearch</groupId>
    +        <artifactId>elasticsearch</artifactId>
    +        <version>${ver.elasticsearch}</version>
    +        <exclusions>
    +          <exclusion>
    +            <groupId>commons-logging</groupId>
    +            <artifactId>commons-logging</artifactId>
    +          </exclusion>
    +          <exclusion>
    +            <groupId>org.hamcrest</groupId>
    +            <artifactId>hamcrest-core</artifactId>
    +          </exclusion>
    +        </exclusions>
    +
    +      </dependency>
    +
    +      <dependency>
    +        <groupId>org.elasticsearch.client</groupId>
    +        <artifactId>transport</artifactId>
    +        <version>${ver.elasticsearch}</version>
    +        <exclusions>
    +          <exclusion>
    +            <groupId>commons-logging</groupId>
    +            <artifactId>commons-logging</artifactId>
    +          </exclusion>
    +          <exclusion>
    +            <groupId>org.hamcrest</groupId>
    +            <artifactId>hamcrest-core</artifactId>
    +          </exclusion>
    +        </exclusions>
    +      </dependency>
    +
    +
    +      <dependency>
    --- End diff --
    
    Is this dependency really needed? I commented it out in both the jena-parent and jena-text pom files and the build and tests still worked fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106147470
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            /**
    +             * We are creating an upsert request here instead of a simple insert request.
    +             * The reason is we want to add a document if it does not exist with the given Subject Id (URI).
    +             * But if the document exists with the same Subject Id, we want to do an update to it instead of deleting it and
    +             * then creating it with only the latest field values.
    +             * This functionality is called Upsert functionality and more can be learned about it here:
    +             * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
    +             */
    +
    +            //First Search of the field exists or not
    +            SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                    .setTypes(docDef.getEntityField())
    +                    .setQuery(QueryBuilders.existsQuery(fieldToAdd))
    +                    .get();
    +            String script;
    +            if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +                //This means field already exists and therefore we should append to it
    +                script = "ctx._source." + fieldToAdd+".add('"+ fieldValueToAdd + "')";
    +            } else {
    +                //The field does not exists. so we create one
    +                script = "ctx._source." + fieldToAdd+" =['"+ fieldValueToAdd + "']";
    +            }
    +
    +
    +
    +            UpdateRequest upReq = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(script))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +
    +
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +        //First Search of the field exists or not
    +        SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.existsQuery(fieldToRemove))
    +                .get();
    +
    +        String script = null;
    +        if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +            //This means field already exists and therefore we should remove from it
    +            script = "ctx._source." + fieldToRemove+".remove('"+ valueToRemove + "')";
    +        }
    +
    +        UpdateRequest updateRequest = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +//        client.prepareDelete(INDEX_NAME, docDef.getEntityField(), entity.getId()).get();
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    +
    +        GetResponse response;
    +        Map<String, Node> result = new HashMap<>();
    +
    +        if(uri != null) {
    +            response = client.prepareGet(INDEX_NAME, docDef.getEntityField(), uri).get();
    +            if(response != null && !response.isSourceEmpty()) {
    +                String entityField = response.getId();
    +                Node entity = NodeFactory.createURI(entityField) ;
    +                result.put(docDef.getEntityField(), entity);
    +                for (String field: docDef.fields()) {
    +
    +                    GetField fieldResponse = response.getField(field);
    +
    +                    if(fieldResponse == null || fieldResponse.getValue() == null) {
    +                        //We wont return it.
    +                        continue;
    +                    }
    +                    if(fieldResponse instanceof List<?>) {
    +                        //We are only interested in literal values
    +                        continue;
    +                    }
    +                    //We assume it will always be a String value.
    +                    String fieldValue = (String)fieldResponse.getValue();
    +                    Node fieldNode = NodeFactoryExtra.createLiteralNode(fieldValue, null, null);
    +                    result.put(field, fieldNode);
    +
    +                }
    +
    +
    +            }
    +        }
    +
    +        return result;
    +    }
    +
    +    @Override
    +    public List<TextHit> query(Node property, String qs) {
    +
    +        return query(property, qs, 0);
    --- End diff --
    
    The query limit defaults to 0. I think this will give zero results though I'm not sure if ES considers the value 0 special. jena-text/Lucene has a constant MAX_N (defined as 10000) which is used as the default in this situation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @anujgandharv Reading the documentation of the ES Maven plugin, no, I don't think it's a wrapper around embedded Elasticsearch. Here is a project that appears to use it: https://github.com/dadoonet/spring-elasticsearch
    
    Regardless of how the tests are executed, I think we need tests for the more difficult cases, such as removing one `Berlin` label but leaving others. It is quite possible that bugs in the implementation will be discovered when doing that. In any case, tests guard against subtle changes further down the line, when jena-text is updated, the Elasticsearch dependency is upgraded to a new version etc.
    
    Thanks for incorporating the comments. I will do a new review of the code and take a closer look at the current tests.
    
    I think this is looking very close to merging. Do others have comments on the code? @ajs6f? @afs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106154483
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    --- End diff --
    
    Renamed so each of them end with _PARAM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107681019
  
    --- Diff: jena-text/pom.xml ---
    @@ -112,11 +141,73 @@
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-surefire-plugin</artifactId>
             <configuration>
    -          <includes>
    -            <include>**/TS_*.java</include>
    -          </includes>
    +            <!-- Skip the default running of this plug-in (or everything is run twice...see below) -->
    +            <skip>true</skip>
    +            <!--<excludedGroups>org.apache.jena.query.text.IntegrationTest</excludedGroups>-->
             </configuration>
    +          <executions>
    +              <execution>
    +                  <id>unit-tests</id>
    +                  <phase>test</phase>
    +                  <goals>
    +                      <goal>test</goal>
    +                  </goals>
    +                  <configuration>
    +                      <skip>false</skip>
    +                      <includes>
    +                          <include>**/TS_*.java</include>
    +                      </includes>
    +                      <excludes>
    +                          <exclude>**/*IT.java</exclude>
    +                      </excludes>
    +                  </configuration>
    +              </execution>
    +              <execution>
    +                  <id>integration-tests</id>
    +                  <phase>integration-test</phase>
    +                  <goals>
    +                      <goal>test</goal>
    +                  </goals>
    +                  <configuration>
    +                      <skip>false</skip>
    +                      <includes>
    +                          <include>**/*IT.java</include>
    +                      </includes>
    +                  </configuration>
    +              </execution>
    +          </executions>
           </plugin>
    +        <plugin>
    +            <groupId>com.github.alexcojocaru</groupId>
    +            <artifactId>elasticsearch-maven-plugin</artifactId>
    +            <!-- REPLACE THE FOLLOWING WITH THE PLUGIN VERSION YOU NEED -->
    +            <version>5.2</version>
    +            <configuration>
    +                <clusterName>elasticsearch</clusterName>
    +                <tcpPort>9300</tcpPort>
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107671430
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,435 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.commons.lang3.exception.ExceptionUtils;
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.engine.DocumentMissingException;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'jena-text'
    +     */
    +    private final String indexName;
    +
    +    /**
    +     * The parameter representing the cluster name key
    +     */
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    /**
    +     * The parameter representing the number of shards key
    +     */
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    /**
    +     * The parameter representing the number of replicas key
    +     */
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    private static final String DASH = "-";
    +
    +    private static final String UNDERSCORE = "_";
    +
    +    private static final String COLON = ":";
    +
    +    private static final String ASTREIX = "*";
    --- End diff --
    
    typo, should be ASTERISK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106145397
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    --- End diff --
    
    If you don't intend to support it (and I believe it would take a lot of effort, not just this code snippet), I suggest you remove the currently commented out code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106152482
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/assembler/TextIndexESAssembler.java ---
    @@ -0,0 +1,129 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text.assembler;
    +
    +import org.apache.jena.assembler.Assembler;
    +import org.apache.jena.assembler.Mode;
    +import org.apache.jena.assembler.assemblers.AssemblerBase;
    +import org.apache.jena.query.text.*;
    +import org.apache.jena.rdf.model.RDFNode;
    +import org.apache.jena.rdf.model.Resource;
    +import org.apache.jena.rdf.model.Statement;
    +import org.apache.jena.sparql.util.graph.GraphUtils;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +import static org.apache.jena.query.text.assembler.TextVocab.*;
    +
    +public class TextIndexESAssembler extends AssemblerBase {
    +
    +    private static Logger LOGGER      = LoggerFactory.getLogger(TextIndexESAssembler.class) ;
    +
    +    protected static final String COMMA = ",";
    +    protected static final String COLON = ":";
    +    /*
    +    <#index> a :TextIndexES ;
    +        text:serverList "127.0.0.1:9300,127.0.0.2:9400,127.0.0.3:9500" ; #Comma separated list of hosts:ports
    +        text:clusterName "elasticsearch"
    +        text:shards "1"
    +        text:replicas "1"
    +        text:entityMap <#endMap> ;
    +        .
    +    */
    +    
    +    @SuppressWarnings("resource")
    +    @Override
    +    public TextIndex open(Assembler a, Resource root, Mode mode) {
    +        try {
    +            String listOfHostsAndPorts = GraphUtils.getAsStringValue(root, pServerList) ;
    +            if(listOfHostsAndPorts == null || listOfHostsAndPorts.isEmpty()) {
    +                throw new TextIndexException("Mandatory property text:serverList (containing the comma-separated list of host:port) property is not specified. " +
    +                        "An example value for the property: 127.0.0.1:9300");
    +            }
    +            String[] hosts = listOfHostsAndPorts.split(COMMA);
    +            Map<String,Integer> hostAndPortMapping = new HashMap<>();
    +            for(String host : hosts) {
    +                String[] hostAndPort = host.split(COLON);
    +                if(hostAndPort.length < 2) {
    +                    LOGGER.error("Either the host or the port value is missing.Please specify the property in host:port format. " +
    +                            "Both parts are mandatory. Ignoring this value. Moving to the next one.");
    +                    continue;
    +                }
    +                hostAndPortMapping.put(hostAndPort[0], Integer.valueOf(hostAndPort[1]));
    +            }
    +
    +            String clusterName = GraphUtils.getAsStringValue(root, pClusterName);
    +            if(clusterName == null || clusterName.isEmpty()) {
    +                LOGGER.warn("ClusterName property is not specified. Defaulting to 'elasticsearch'");
    +                clusterName = "elasticsearch";
    +            }
    +
    +            String numberOfShards = GraphUtils.getAsStringValue(root, pShards);
    +            if(numberOfShards == null || numberOfShards.isEmpty()) {
    +                LOGGER.warn("shards property is not specified. Defaulting to '1'");
    +                numberOfShards = "1";
    +            }
    +
    +            String replicationFactor = GraphUtils.getAsStringValue(root, pReplicas);
    +            if(replicationFactor == null || replicationFactor.isEmpty()) {
    +                LOGGER.warn("replicas property is not specified. Defaulting to '1'");
    +                replicationFactor = "1";
    +            }
    +
    +            String indexName = GraphUtils.getAsStringValue(root, pIndexName);
    +            if(indexName == null || indexName.isEmpty()) {
    +                LOGGER.warn("index Name property is not specified. Defaulting to 'test'");
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @anujgandharv I didn't mean that we should introduce a Spring dependency. I just found the project and thought it could serve as an example for integrating the ES Maven plugin.
    
    Does this article help? http://www.esentri.com/blog/2017/02/02/integration-tests-elasticsearch/
    
    What errors do you get using the Maven ES plugin?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106140071
  
    --- Diff: jena-parent/pom.xml ---
    @@ -258,37 +257,92 @@
             <version>${ver.lucene}</version>
           </dependency>
     
    -      <!-- Solr client -->
    -      <!-- Exclusion of slf4j: Necessary so as to pick the version we want. 
    -           solrj->zookeeper has a dependency on slf4j -->
    +      <!-- For jena-spatial -->
    +      <dependency>
    +        <groupId>org.apache.lucene</groupId>
    +        <artifactId>lucene-spatial</artifactId>
    +        <version>${ver.lucene}</version>
    +      </dependency>
    +
    +      <dependency>
    +        <groupId>org.apache.lucene</groupId>
    +        <artifactId>lucene-spatial-extras</artifactId>
    +        <version>${ver.lucene}</version>
    +      </dependency>
    +
    +      <dependency>
    +        <groupId>org.locationtech.spatial4j</groupId>
    +        <artifactId>spatial4j</artifactId>
    +        <version>${ver.spatial4j}</version>
    +      </dependency>
     
    +      <!-- ES dependencies-->
           <dependency>
    -        <artifactId>solr-solrj</artifactId>
    -        <groupId>org.apache.solr</groupId>
    -        <version>${ver.solr}</version>
    +        <groupId>org.elasticsearch</groupId>
    +        <artifactId>elasticsearch</artifactId>
    +        <version>${ver.elasticsearch}</version>
             <exclusions>
               <exclusion>
    -            <groupId>org.slf4j</groupId>
    -            <artifactId>slf4j-api</artifactId>
    +            <groupId>commons-logging</groupId>
    --- End diff --
    
    I suppose there are good reasons for the exclusions in this file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107680962
  
    --- Diff: jena-parent/pom.xml ---
    @@ -275,6 +276,27 @@
             <version>${ver.spatial4j}</version>
           </dependency>
     
    +      <!-- ES dependencies-->
    +      <dependency>
    +        <groupId>org.elasticsearch</groupId>
    +        <artifactId>elasticsearch</artifactId>
    +        <version>${ver.elasticsearch}</version>
    +      </dependency>
    +
    +      <dependency>
    +        <groupId>org.elasticsearch.client</groupId>
    +        <artifactId>transport</artifactId>
    +        <version>${ver.elasticsearch}</version>
    +      </dependency>
    +
    +
    +      <dependency>
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106405506
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/TestTextIndexES.java ---
    @@ -0,0 +1,184 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.jena.query.text;
    +
    +
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.vocabulary.RDFS;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.test.ESIntegTestCase;
    +import org.junit.Assert;
    +import org.junit.Ignore;
    +import org.junit.Test;
    +
    +import java.util.List;
    +import java.util.Map;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + *
    + * Integration test for {@link TextIndexES} class
    + * ES Integration test depends on security policies that may sometime not be loaded properly.
    + * If you find any issues regarding security set the following VM argument to resolve the issue:
    + * -Dtests.security.manager=false
    + *
    + */
    +@ESIntegTestCase.ClusterScope()
    +public class TestTextIndexES extends ESIntegTestCase {
    --- End diff --
    
    These are good unit tests, but they don't really exercise all the difficult cases. Could you use the Germany/Berlin example I gave earlier to construct some more unit tests?
    
    For example, start with a triple `:de rdfs:label "Germany"@en` and test that the index finds the entity using `Germany`. Then add the label `"Deutschland"@de` to the same entity and test that it can be found using both labels. Then delete the `"Germany"@en` label and test that the entity can be found using `Deutschland` but not `Germany`.
    
    Similar for the Berlin examples: Start with `:berlin rdfs:label "Berlin"@en` and test that the entity can be found using `Berlin`. Then add `"Berlin"@de` and test that it can still be found. Remove `"Berlin"@en` and test that the entity can still be found using `Berlin`. Finally remove `"Berlin"@de` and test that the entity can no longer be found using `Berlin`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106428753
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +
    +        String script = "if(ctx._source.<fieldToRemove> != null && (ctx._source.<fieldToRemove>.empty != true) " +
    +                "&& (ctx._source.<fieldToRemove>.indexOf('<valueToRemove>') >= 0)) " +
    +                "{ctx._source.<fieldToRemove>.remove(ctx._source.<fieldToRemove>.indexOf('<valueToRemove>'))}";
    +        script = script.replaceAll("<fieldToRemove>", fieldToRemove).replaceAll("<valueToRemove>", valueToRemove);
    --- End diff --
    
    Actually i have a workaround. I will test it by running a local instance of the cluster. I will use the JenaESTextExample to test the script out


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @osma @ajs6f I have tried the ES Maven plugin but it is throwing me lots of errors. I suspect that it is just a wrapper around the embedded elasticsearch. I asked the creator to provide a more comprehensive example of using the plugin. I am still waiting for his answer.
    
    In the mean time, in the interest of time and getting this functionality, I would "please" suggest that we keep the Tests that we have currently which serve for basic testing of the functionality and utilize JenaESTextExample.java class implements a more comprehensive testing.
    
    In any case, we require a running instance of ElasticSearch. Till the time we do not have a mechanism to start and stop an ElasticSearch instance automatically, we can start and stop it manually. I have documented how to do that in the JenaESTextExample.java.
    
    Rest of the review comments have been incorporated (unless I missed something again).
    
    
    Do you guys agree? and can we merge it in?
    
    Thank,
    Anuj Kumar


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107864529
  
    --- Diff: jena-text/src/main/java/examples/JenaESTextExample.java ---
    @@ -0,0 +1,94 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package examples;
    +
    +import org.apache.jena.atlas.lib.StrUtils;
    +import org.apache.jena.query.*;
    +import org.apache.jena.sparql.util.QueryExecUtils;
    +
    +/**
    + * Simple example class to test the {@link org.apache.jena.query.text.assembler.TextIndexESAssembler}
    + * For this class to work properly, an elasticsearch node should be up and running, otherwise it will fail.
    + * You can find the details of downloading and running an ElasticSearch version here: https://www.elastic.co/downloads/past-releases/elasticsearch-5-2-1
    + * Unzip the file in your favourite directory and then execute the appropriate file under the bin directory.
    + * It will take less than a minute.
    + * In order to visualize what is written in ElasticSearch, you need to download and run Kibana: https://www.elastic.co/downloads/kibana
    + * To run kibana, just go to the bin directory and execute the appropriate file.
    + * We need to resort to this mechanism as ElasticSearch has stopped supporting embedded ElasticSearch.
    + *
    + * In addition we cant have it in the test package because ElasticSearch
    + * detects the thread origin and stops us from instantiating a client.
    + */
    +public class JenaESTextExample {
    +
    +    public static void main(String[] args) {
    +
    +        queryData(loadData(createAssembler()));
    +    }
    +
    +
    +    private static Dataset createAssembler() {
    +        String assemblerFile = "text-config-es.ttl";
    +        Dataset ds = DatasetFactory.assemble(assemblerFile,
    +                "http://localhost/jena_example/#text_dataset") ;
    +        return ds;
    +    }
    +
    +    private static Dataset loadData(Dataset ds) {
    +        JenaTextExample1.loadData(ds, "data-es.ttl");
    +        return ds;
    +    }
    +
    +    /**
    +     * Query Data
    +     * @param ds
    +     */
    +    private static void queryData(Dataset ds) {
    +//        JenaTextExample1.queryData(ds);
    --- End diff --
    
    Ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106869330
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    --- End diff --
    
    Removed Multilingual checks from the latest commit


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @anujgandharv Ah, right, sorry, I didn't remember to check that case. Good thing you got it working! I will proceed with merging #226 first and then let's get on with merging this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106143179
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/assembler/TextIndexESAssembler.java ---
    @@ -0,0 +1,129 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text.assembler;
    +
    +import org.apache.jena.assembler.Assembler;
    +import org.apache.jena.assembler.Mode;
    +import org.apache.jena.assembler.assemblers.AssemblerBase;
    +import org.apache.jena.query.text.*;
    +import org.apache.jena.rdf.model.RDFNode;
    +import org.apache.jena.rdf.model.Resource;
    +import org.apache.jena.rdf.model.Statement;
    +import org.apache.jena.sparql.util.graph.GraphUtils;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +import static org.apache.jena.query.text.assembler.TextVocab.*;
    +
    +public class TextIndexESAssembler extends AssemblerBase {
    +
    +    private static Logger LOGGER      = LoggerFactory.getLogger(TextIndexESAssembler.class) ;
    +
    +    protected static final String COMMA = ",";
    +    protected static final String COLON = ":";
    +    /*
    +    <#index> a :TextIndexES ;
    +        text:serverList "127.0.0.1:9300,127.0.0.2:9400,127.0.0.3:9500" ; #Comma separated list of hosts:ports
    +        text:clusterName "elasticsearch"
    +        text:shards "1"
    +        text:replicas "1"
    +        text:entityMap <#endMap> ;
    +        .
    +    */
    +    
    +    @SuppressWarnings("resource")
    +    @Override
    +    public TextIndex open(Assembler a, Resource root, Mode mode) {
    +        try {
    +            String listOfHostsAndPorts = GraphUtils.getAsStringValue(root, pServerList) ;
    +            if(listOfHostsAndPorts == null || listOfHostsAndPorts.isEmpty()) {
    +                throw new TextIndexException("Mandatory property text:serverList (containing the comma-separated list of host:port) property is not specified. " +
    +                        "An example value for the property: 127.0.0.1:9300");
    +            }
    +            String[] hosts = listOfHostsAndPorts.split(COMMA);
    +            Map<String,Integer> hostAndPortMapping = new HashMap<>();
    +            for(String host : hosts) {
    +                String[] hostAndPort = host.split(COLON);
    +                if(hostAndPort.length < 2) {
    +                    LOGGER.error("Either the host or the port value is missing.Please specify the property in host:port format. " +
    +                            "Both parts are mandatory. Ignoring this value. Moving to the next one.");
    +                    continue;
    +                }
    +                hostAndPortMapping.put(hostAndPort[0], Integer.valueOf(hostAndPort[1]));
    +            }
    +
    +            String clusterName = GraphUtils.getAsStringValue(root, pClusterName);
    +            if(clusterName == null || clusterName.isEmpty()) {
    +                LOGGER.warn("ClusterName property is not specified. Defaulting to 'elasticsearch'");
    +                clusterName = "elasticsearch";
    +            }
    +
    +            String numberOfShards = GraphUtils.getAsStringValue(root, pShards);
    +            if(numberOfShards == null || numberOfShards.isEmpty()) {
    +                LOGGER.warn("shards property is not specified. Defaulting to '1'");
    +                numberOfShards = "1";
    +            }
    +
    +            String replicationFactor = GraphUtils.getAsStringValue(root, pReplicas);
    +            if(replicationFactor == null || replicationFactor.isEmpty()) {
    +                LOGGER.warn("replicas property is not specified. Defaulting to '1'");
    +                replicationFactor = "1";
    +            }
    +
    +            String indexName = GraphUtils.getAsStringValue(root, pIndexName);
    +            if(indexName == null || indexName.isEmpty()) {
    +                LOGGER.warn("index Name property is not specified. Defaulting to 'test'");
    --- End diff --
    
    The logged message is wrong, index name defaults to "jena-text"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    #226 has been merged. I suggest the following plan for merging this PR:
    
    - @anujgandharv it would be great if you could rebase on top of current apache master, which includes #226. That would reduce clutter when looking at the diffs in this PR.
    - I'll wait until Monday (or possibly Tuesday) to give others (@ajs6f, @afs, @ehedgehog, @rvesse ...) a chance to look at the code. I'll merge it then if there are no objections.
    
    Personally I'm satisfied with the current implementation and tests.
    
    Things I'm unsure about:
    1. How the Maven integration tests have been wired up. There seems to be quite a lot of configuration related to this in the new pom.xml, and it works AFAICT. But I've never done a custom Maven test configuration so I can't tell whether it's done in the right way. People who know more about the Jena testing framework could take a closer look.
    2. Logging setup, dependencies etc. Same thing, I can't tell whether the current setup is right for the project.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106142391
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/ESSettings.java ---
    @@ -0,0 +1,177 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.jena.query.text;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +/**
    + * Settings for ElasticSearch based indexing
    + */
    +public class ESSettings {
    +
    +    /**
    +     * Map of hosts and ports. The host could also be an IP Address
    +     */
    +    private Map<String,Integer> hostToPortMapping;
    +
    +    /**
    +     * Name of the Cluster. Defaults to 'elasticsearch'
    +     */
    +    private String clusterName;
    +
    +    /**
    +     * Number of shards. Defaults to '1'
    +     */
    +    private Integer shards;
    +
    +    /**
    +     * Number of replicas. Defaults to '1'
    +     */
    +    private Integer replicas;
    +
    +    /**
    +     * Name of the index. Defaults to 'test'
    --- End diff --
    
    Should default to "jena-text"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    Hi @anujgandharv, thanks for the update, I will do a new review soon.
    
    Regarding `TextIndex.get`, excellent question! I was actually wondering the same when I did the review yesterday, but didn't take a closer look then. It seems like this method is never called from within Jena! The only exception is your TestTextIndexES class that specifically tests for this method.
    
    I don't know what the reason for this method is. Maybe it was useful some time ago, or maybe it was created for some purpose that never really materialized. I think the `get` method could simply be removed from the TextIndex interface and all implementations. I'm sorry that you had to spend time implementing it.
    
    The situation is similar for the `updateEntity` method. It is also not called from Jena code and could be dropped.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106403820
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    --- End diff --
    
    The atomic update looks really good now!
    
    However, I suggest that you pass at least the field value and maybe also the field name (not sure whether that is possible) as named parameters to the ES script, as [explained in the ES documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-using.html#prefer-params). That should perform better and also avoid the somewhat awkward `replaceAll` string operations here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    For some reason GitHub complains about merge conflicts on this PR, but I don't see why. The files in question were modified by JENA-1301 (PR #220) but this branch already contains those commits. I was able to perform the merge locally with no problems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106429631
  
    --- Diff: jena-text/src/main/java/examples/JenaESTextExample.java ---
    @@ -0,0 +1,64 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package examples;
    +
    +import org.apache.jena.query.Dataset;
    +import org.apache.jena.query.DatasetFactory;
    +
    +/**
    + * Simple example class to test the {@link org.apache.jena.query.text.assembler.TextIndexESAssembler}
    + * For this class to work properly, an elasticsearch node should be up and running, otherwise it will fail.
    + * You can find the details of downloading and running an ElasticSearch version here: https://www.elastic.co/downloads/past-releases/elasticsearch-5-2-1
    + * Unzip the file in your favourite directory and then execute the appropriate file under the bin directory.
    + * It will take less than a minute.
    + * In order to visualize what is written in ElasticSearch, you need to download and run Kibana: https://www.elastic.co/downloads/kibana
    + * To run kibana, just go to the bin directory and execute the appropriate file.
    + * We need to resort to this mechanism as ElasticSearch has stopped supporting embedded ElasticSearch.
    + *
    + * In addition we cant have it in the test package because ElasticSearch
    + * detects the thread origin and stops us from instantiating a client.
    + */
    +public class JenaESTextExample {
    --- End diff --
    
    Oh. I didnt realise that. I havent renamed it. I think it got deleted by mistake.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

https://github.com/apache/jena/pull/227

I tested the ES backend with some non-toy SKOS data, namely [YSO](http://finto.fi/en/yso/). I configured the entity definition to index the predicates `skos:prefLabel`, `skos:altLabel` and `skos:hiddenLabel`. The dataset has 520k triples and 29k entities. There are in total 150k triples with these label properties.

I'm using a rather old laptop (i3-2330M with SSD) for the test. Ubuntu 16.04, ES 5.2.1.

Using the ES backend, indexing this dataset took about 25 minutes:
```
16:42:45 INFO [1] PUT http://localhost:3030/ds/data?default
17:08:06 INFO [1] 204 No Content (1�521,465 s)
```

Looking at process stats, most of the time was spent by ES. It spent about 38 minutes CPU time.

I also indexed the same dataset using the Lucene backend. It took less than 30 seconds:
```
17:11:26 INFO [1] PUT http://localhost:3030/ds/data?default
17:11:55 INFO [1] 204 No Content (28,237 s)
```

Query performance seems to be pretty much the same, in fact the ES backend seems slightly faster than the Lucene backend but there was a lot of variance so I can't tell for sure.

I have my doubts about whether the indexing performance is acceptable for real world use cases like what @anujgandharv is targeting, but I don't think this should stop us from merging this contribution. Since there have been no objections, I will proceed with the merge.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on the issue:

    https://github.com/apache/jena/pull/227
  
    :+1:


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    Great. Thanks @Osma. I misunderstood your previous comment. I will implement the integration tests for the above scenarios.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106151951
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextDatasetFactory.java ---
    @@ -27,7 +27,7 @@
     import org.apache.jena.system.JenaSystem ;
     import org.apache.lucene.analysis.Analyzer;
     import org.apache.lucene.store.Directory ;
    -import org.apache.solr.client.solrj.SolrServer ;
    +import org.elasticsearch.indices.IndexCreationException;
    --- End diff --
    
    Agree. Made the changes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106144452
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    --- End diff --
    
    Should default to "jena-text"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106638262
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +
    +        String script = "if(ctx._source.<fieldToRemove> != null && (ctx._source.<fieldToRemove>.empty != true) " +
    +                "&& (ctx._source.<fieldToRemove>.indexOf('<valueToRemove>') >= 0)) " +
    +                "{ctx._source.<fieldToRemove>.remove(ctx._source.<fieldToRemove>.indexOf('<valueToRemove>'))}";
    +        script = script.replaceAll("<fieldToRemove>", fieldToRemove).replaceAll("<valueToRemove>", valueToRemove);
    +
    +        UpdateRequest updateRequest = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    +
    +        GetResponse response;
    +        Map<String, Node> result = new HashMap<>();
    +
    +        if(uri != null) {
    +            response = client.prepareGet(indexName, docDef.getEntityField(), uri).get();
    +            if(response != null && !response.isSourceEmpty()) {
    +                String entityField = response.getId();
    +                Node entity = NodeFactory.createURI(entityField) ;
    +                result.put(docDef.getEntityField(), entity);
    +                Map<String, Object> source = response.getSource();
    +                for (String field: docDef.fields()) {
    +                    Object fieldResponse = source.get(field);
    +
    +                    if(fieldResponse == null) {
    +                        //We wont return it.
    +                        continue;
    +                    }
    +                    else if(fieldResponse instanceof List<?>) {
    +                        //We are storing the values of fields as a List always.
    +                        //If there are values stored in the list, then we return the first value,
    +                        // else we do not include the field in the returned Map of Field -> Node Mapping
    +                        List<?> responseList = (List<?>)fieldResponse;
    +                        if(responseList != null && responseList.size() > 0) {
    +                            String fieldValue = (String)responseList.get(0);
    +                            Node fieldNode = NodeFactoryExtra.createLiteralNode(fieldValue, null, null);
    +                            result.put(field, fieldNode);
    +                        }
    +                    }
    +                }
    +            }
    +        }
    +
    +        return result;
    +    }
    +
    +    @Override
    +    public List<TextHit> query(Node property, String qs) {
    +
    +        return query(property, qs, MAX_RESULTS);
    +    }
    +
    +    /**
    +     * Query the ElasticSearch for the given Node, with the given query String and limit.
    +     * @param property the node property to make a search for
    +     * @param qs the query string
    +     * @param limit limit on the number of records to return
    +     * @return List of {@link TextHit}s containing the documents that have been found
    +     */
    +    @Override
    +    public List<TextHit> query(Node property, String qs, int limit) {
    +
    +        qs = parse(qs);
    +        LOGGER.debug("Querying ElasticSearch for QueryString: " + qs);
    +        SearchResponse response = client.prepareSearch(indexName)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.queryStringQuery(qs))
    +                .setFrom(0).setSize(limit)
    +                .get();
    +
    +        List<TextHit> results = new ArrayList<>() ;
    +        for (SearchHit hit : response.getHits()) {
    +
    +            Node literal;
    +            String field = (property != null) ? docDef.getField(property) : docDef.getPrimaryField();
    +            List<String> value = (List<String>)hit.getSource().get(field);
    +            if(value != null) {
    +                literal = NodeFactory.createLiteral(value.get(0));
    --- End diff --
    
    CASE 1: I believe you mean that the text query has the syntax `X3`. Otherwise, yes, that's correct.
    
    CASE 2: Yes, correct.
    
    CASE 3: Yes, correct.
    
    I can see that this is beginning to look difficult to implement. You have stored the values as arrays in ES, which makes sense for the purpose of keeping the index synchronized with the actual triples, but that makes it difficult to know which of the values actually matched the original query. I don't think you should perform e.g. string matching operations in Java code to find out which of the values in the array actually contained the query string. For one thing, that would cause problems with specialized Analyzers in case support for them are added later. Also I think it is the responsibility of the text index to know which value was matched.
    
    For now I think it is better to leave `literal` as `null` instead of assigning it a potentially incorrect value. That's also what the Lucene backend does when the `storeValues` setting has not been enabled and thus original values are not available in the index.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106152321
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/assembler/TextAssembler.java ---
    @@ -29,14 +29,15 @@ public static void init()
             AssemblerUtils.registerDataset(TextVocab.textDataset,      new TextDatasetAssembler()) ;
             
             Assembler.general.implementWith(TextVocab.entityMap,        new EntityDefinitionAssembler()) ;
    -        Assembler.general.implementWith(TextVocab.textIndexSolr,    new TextIndexSolrAssembler()) ; 
             Assembler.general.implementWith(TextVocab.textIndexLucene,  new TextIndexLuceneAssembler()) ;
             Assembler.general.implementWith(TextVocab.standardAnalyzer, new StandardAnalyzerAssembler()) ;
             Assembler.general.implementWith(TextVocab.simpleAnalyzer,   new SimpleAnalyzerAssembler()) ;
             Assembler.general.implementWith(TextVocab.keywordAnalyzer,  new KeywordAnalyzerAssembler()) ;
             Assembler.general.implementWith(TextVocab.lowerCaseKeywordAnalyzer, new LowerCaseKeywordAnalyzerAssembler()) ;
             Assembler.general.implementWith(TextVocab.localizedAnalyzer, new LocalizedAnalyzerAssembler()) ;
             Assembler.general.implementWith(TextVocab.configurableAnalyzer, new ConfigurableAnalyzerAssembler()) ;
    +        Assembler.general.implementWith(TextVocab.textIndexES,  new TextIndexESAssembler()) ;
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106404329
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +
    +        String script = "if(ctx._source.<fieldToRemove> != null && (ctx._source.<fieldToRemove>.empty != true) " +
    +                "&& (ctx._source.<fieldToRemove>.indexOf('<valueToRemove>') >= 0)) " +
    +                "{ctx._source.<fieldToRemove>.remove(ctx._source.<fieldToRemove>.indexOf('<valueToRemove>'))}";
    +        script = script.replaceAll("<fieldToRemove>", fieldToRemove).replaceAll("<valueToRemove>", valueToRemove);
    +
    +        UpdateRequest updateRequest = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    +
    +        GetResponse response;
    +        Map<String, Node> result = new HashMap<>();
    +
    +        if(uri != null) {
    +            response = client.prepareGet(indexName, docDef.getEntityField(), uri).get();
    +            if(response != null && !response.isSourceEmpty()) {
    +                String entityField = response.getId();
    +                Node entity = NodeFactory.createURI(entityField) ;
    +                result.put(docDef.getEntityField(), entity);
    +                Map<String, Object> source = response.getSource();
    +                for (String field: docDef.fields()) {
    +                    Object fieldResponse = source.get(field);
    +
    +                    if(fieldResponse == null) {
    +                        //We wont return it.
    +                        continue;
    +                    }
    +                    else if(fieldResponse instanceof List<?>) {
    +                        //We are storing the values of fields as a List always.
    +                        //If there are values stored in the list, then we return the first value,
    +                        // else we do not include the field in the returned Map of Field -> Node Mapping
    +                        List<?> responseList = (List<?>)fieldResponse;
    +                        if(responseList != null && responseList.size() > 0) {
    +                            String fieldValue = (String)responseList.get(0);
    +                            Node fieldNode = NodeFactoryExtra.createLiteralNode(fieldValue, null, null);
    +                            result.put(field, fieldNode);
    +                        }
    +                    }
    +                }
    +            }
    +        }
    +
    +        return result;
    +    }
    +
    +    @Override
    +    public List<TextHit> query(Node property, String qs) {
    +
    +        return query(property, qs, MAX_RESULTS);
    +    }
    +
    +    /**
    +     * Query the ElasticSearch for the given Node, with the given query String and limit.
    +     * @param property the node property to make a search for
    +     * @param qs the query string
    +     * @param limit limit on the number of records to return
    +     * @return List of {@link TextHit}s containing the documents that have been found
    +     */
    +    @Override
    +    public List<TextHit> query(Node property, String qs, int limit) {
    +
    +        qs = parse(qs);
    +        LOGGER.debug("Querying ElasticSearch for QueryString: " + qs);
    +        SearchResponse response = client.prepareSearch(indexName)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.queryStringQuery(qs))
    +                .setFrom(0).setSize(limit)
    +                .get();
    +
    +        List<TextHit> results = new ArrayList<>() ;
    +        for (SearchHit hit : response.getHits()) {
    +
    +            Node literal;
    +            String field = (property != null) ? docDef.getField(property) : docDef.getPrimaryField();
    +            List<String> value = (List<String>)hit.getSource().get(field);
    +            if(value != null) {
    +                literal = NodeFactory.createLiteral(value.get(0));
    --- End diff --
    
    Would it be possible to return the original, language-tagged literal value here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106406514
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/TestTextIndexES.java ---
    @@ -0,0 +1,184 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.jena.query.text;
    +
    +
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.vocabulary.RDFS;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.test.ESIntegTestCase;
    +import org.junit.Assert;
    +import org.junit.Ignore;
    +import org.junit.Test;
    +
    +import java.util.List;
    +import java.util.Map;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + *
    + * Integration test for {@link TextIndexES} class
    + * ES Integration test depends on security policies that may sometime not be loaded properly.
    + * If you find any issues regarding security set the following VM argument to resolve the issue:
    + * -Dtests.security.manager=false
    + *
    + */
    +@ESIntegTestCase.ClusterScope()
    +public class TestTextIndexES extends ESIntegTestCase {
    --- End diff --
    
    Also, it would be good to have a unit test that tests for language tag subcodes, e.g. add an entity like
    `:col rdfs:label "color"@en-US, "colour"@en-GB`
    then test that 
    - it can be found using either `color` or `colour`, without lang parameter
    - it can be found using either `color` or `colour` using `lang:en*` as parameter
    - it can be found using `color` `lang:en-US`
    - it cannot be found using `color` `lang:en-GB`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on the issue:

https://github.com/apache/jena/pull/227

@anujgandharv, the first place to look is at the actual `jena-integration-tests` module. You will see there that the tests follow the same form as in other Jena modules (explicitly linked together in suites and run by `maven-surefire`). I would myself have preferred to use more usual practices such as those to which I suppose you to be referring (tests selected by name, `maven-failsafe` for integration tests) but at least this way is consistent across the Jena code base.

So in order to put your tests into that module as it stands you would have to use the Maven plugin to start an ES node before the `test` phase and stop it afterwards. There aren't any such phases that seem particularly suitable to me, but you could use `process-test-classes` and `prepare-package`, I suppose.

On the other hand, it might be better (I think it would be better) to put your tests into that module and execute them with `maven-failsafe`. Then you can use the `pre-` and `post-integration-test` phases to start and stop the ES node, which is (to my understanding) the right and normal timing for that kind of operation.

A last alternative would be to break that integration test module down into submodules, one for each module being tested. That's a little more hierarchy and structure than Jena usually uses with maven, but this might be a good place for it.

@afs, what do you think?

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    Regarding merging, I would like to merge PR #226 first which changes the `TextIndex.query` method, breaking out the `lang` and `graph` information into separate parameters. The effect on this code should be that the `parse` method becomes unnecessary, but all the calls to the `query` method in the integration tests need to be changed slightly. I can do that myself, as I don't want to burden @anujgandharv anymore with extra work, but if you like you can switch your code on top of that branch already.
    
    Do people have opinions about whether to squash these (currently) 29 commits into a single commit for `master`, or leave them as they are? It's a trade-off between historical accuracy and clarity. There have been quite a few back-and-forth changes in this branch. I would be inclined to squash and rebase before merging, making it clear what has actually changed in the jena-text code. The original commits should still be available in this PR in case anyone is interested in those. Opinions? @ajs6f? @afs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106151436
  
    --- Diff: jena-text/src/main/java/examples/JenaTextExample1.java ---
    @@ -41,9 +41,9 @@
         
         public static void main(String ... argv)
         {
    -        Dataset ds = createCode() ;
    -        //Dataset ds = createAssembler() ;
    -        loadData(ds , "data.ttl") ;
    +//        Dataset ds = createCode() ;
    +        Dataset ds = createAssembler() ;
    --- End diff --
    
    Reverted changes to Lucene Example. It was an accidental checkin.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106141932
  
    --- Diff: jena-text/src/main/java/examples/JenaESTextExample.java ---
    @@ -0,0 +1,65 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package examples;
    +
    +import org.apache.jena.query.Dataset;
    +import org.apache.jena.query.DatasetFactory;
    +
    +/**
    + * Simple example class to test the {@link org.apache.jena.query.text.assembler.TextIndexESAssembler}
    + * For this class to work properly, an elasticsearch node should be up and running, otherwise it will fail.
    + * You can find the details of downloading and running an ElasticSearch version here: https://www.elastic.co/downloads/past-releases/elasticsearch-5-2-1
    + * Unzip the file in your favourite directory and then execute the appropriate file under the bin directory.
    + * It will take less than a minute.
    + * In order to visualize what is written in ElasticSearch, you need to download and run Kibana: https://www.elastic.co/downloads/kibana
    + * To run kibana, just go to the bin directory and execute the appropriate file.
    + * We need to resort to this mechanism as ElasticSearch has stopped supporting embedded ElasticSearch.
    + *
    + * In addition we cant have it in the test package because ElasticSearch
    + * detects the thread origin and stops us from instantiating a client.
    + */
    +public class JenaESTextExample {
    +
    +    public static void main(String[] args) {
    +
    +        queryData(loadData(createAssembler()));
    +    }
    +
    +
    +    private static Dataset createAssembler() {
    +        String assemblerFile = "text-config-es.ttl";
    +        Dataset ds = DatasetFactory.assemble(assemblerFile,
    +                "http://localhost/jena_example/#text_dataset") ;
    +        return ds;
    +    }
    +
    +    private static Dataset loadData(Dataset ds) {
    +        JenaTextExample1.loadData(ds, "data-es.ttl");
    +        return ds;
    +    }
    +
    +    /**
    +     * The data being queried from ElasticSearch is proper but what is getting printed is wrong.
    --- End diff --
    
    This comment is a bit puzzling. Could you fix the underlying issue easily?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106428124
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    --- End diff --
    
    Outdated comment. will update


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106869941
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    --- End diff --
    
    Reworked the scripts to pass fieldValue as parameters. This will result in the script NOT being recompiled every time.
    Can not pass the field name as parameter because apparently ES does a param check with the values (right side of = sign).
    In any case this would not result in any extra performance degradation because the script is compilable per field type.
    Also if the value contains single quote then the single quote is preserved. Tested it with JenaESTextExample and the latest scripting update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106156794
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            /**
    +             * We are creating an upsert request here instead of a simple insert request.
    +             * The reason is we want to add a document if it does not exist with the given Subject Id (URI).
    +             * But if the document exists with the same Subject Id, we want to do an update to it instead of deleting it and
    +             * then creating it with only the latest field values.
    +             * This functionality is called Upsert functionality and more can be learned about it here:
    +             * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
    +             */
    +
    +            //First Search of the field exists or not
    +            SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                    .setTypes(docDef.getEntityField())
    +                    .setQuery(QueryBuilders.existsQuery(fieldToAdd))
    +                    .get();
    +            String script;
    +            if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +                //This means field already exists and therefore we should append to it
    +                script = "ctx._source." + fieldToAdd+".add('"+ fieldValueToAdd + "')";
    +            } else {
    +                //The field does not exists. so we create one
    +                script = "ctx._source." + fieldToAdd+" =['"+ fieldValueToAdd + "']";
    +            }
    +
    +
    +
    +            UpdateRequest upReq = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(script))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +
    +
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +        //First Search of the field exists or not
    +        SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.existsQuery(fieldToRemove))
    +                .get();
    +
    +        String script = null;
    +        if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +            //This means field already exists and therefore we should remove from it
    +            script = "ctx._source." + fieldToRemove+".remove('"+ valueToRemove + "')";
    +        }
    +
    +        UpdateRequest updateRequest = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +//        client.prepareDelete(INDEX_NAME, docDef.getEntityField(), entity.getId()).get();
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    +
    +        GetResponse response;
    +        Map<String, Node> result = new HashMap<>();
    +
    +        if(uri != null) {
    +            response = client.prepareGet(INDEX_NAME, docDef.getEntityField(), uri).get();
    +            if(response != null && !response.isSourceEmpty()) {
    +                String entityField = response.getId();
    +                Node entity = NodeFactory.createURI(entityField) ;
    +                result.put(docDef.getEntityField(), entity);
    +                for (String field: docDef.fields()) {
    +
    +                    GetField fieldResponse = response.getField(field);
    +
    +                    if(fieldResponse == null || fieldResponse.getValue() == null) {
    +                        //We wont return it.
    +                        continue;
    +                    }
    +                    if(fieldResponse instanceof List<?>) {
    +                        //We are only interested in literal values
    +                        continue;
    +                    }
    +                    //We assume it will always be a String value.
    +                    String fieldValue = (String)fieldResponse.getValue();
    +                    Node fieldNode = NodeFactoryExtra.createLiteralNode(fieldValue, null, null);
    --- End diff --
    
    Should be. Let me try to rework on this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    Cool. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106145817
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    --- End diff --
    
    This works a bit differently than multilingual mode in jena-text/Lucene. Over there, in multilingual mode, values are stored both in a language specific field (e.g. `label_en`) and in a language-agnostic field (e.g. `lang`). The benefit is that language-specific fields may also use language-specific Analyzers which can do things like stemming and stopword filtering. At query time, the language-specific field is targeted when the language parameter is used, but if not, then the language-agnostic field is used instead. The downside is a slightly larger index size.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    Thanks @anujgandharv for the very quick responses to my comments! Please let me know when you're done with them all and want a second review.
    
    It would be helpful if your commit messages were a bit more informative. Current ones mention that they're related to JENA-1305 and ES but that doesn't distinguish them from each other. It's good to keep the JENA-1305 tag but the rest of the message should be about what was changed in that specific commit and why. That way it's easier to see immediately what issues you have worked on in your follow-up commits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on the issue:

    https://github.com/apache/jena/pull/227
  
    I don't know enough to have a useful opinion about how thorough the tests need to be-- I am very happy to defer to @osma about that.
    
    As for the test framework-- my (small) knowledge of ES encourages me to agree that we want to avoid relying on the embedded mode if at _all_ possible. If we have to use it, we should immediately open an additional Jena ticket to get rid of it! :disappointed: It seems just as well to me to avoid relying on the ES test framework, not because I think it is inherently problematic, but because it introduces another piece of software that has to be understood to maintain this new functionality.
    
    In case it's not obvious @anujgandharv (although I am guessing it is) @osma and I are very concerned to both get this new code merged, but to do it in such a way that we can reasonably expect to maintain it. With a fairly small all-volunteer body of maintainers, that is a very real and concrete concern.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on the issue:

    https://github.com/apache/jena/pull/227
  
    I'm definitely cool with squashing a bit. I think that helps posterity a lot. Two possible (plausible?) principles:
    
    1. Each post-squash commit should move the codebase from a does-build state to a does-build state.
    2. Each post-squash commit should have a very clear log message, so that if you just read the log messages without seeing the code, you would get a rough idea of what was being attempted. (No cut-and-pasting log messages over and over, even though I do that all the time! :) )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106158114
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            /**
    +             * We are creating an upsert request here instead of a simple insert request.
    +             * The reason is we want to add a document if it does not exist with the given Subject Id (URI).
    +             * But if the document exists with the same Subject Id, we want to do an update to it instead of deleting it and
    +             * then creating it with only the latest field values.
    +             * This functionality is called Upsert functionality and more can be learned about it here:
    +             * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
    +             */
    +
    +            //First Search of the field exists or not
    +            SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                    .setTypes(docDef.getEntityField())
    +                    .setQuery(QueryBuilders.existsQuery(fieldToAdd))
    +                    .get();
    +            String script;
    +            if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +                //This means field already exists and therefore we should append to it
    +                script = "ctx._source." + fieldToAdd+".add('"+ fieldValueToAdd + "')";
    +            } else {
    +                //The field does not exists. so we create one
    +                script = "ctx._source." + fieldToAdd+" =['"+ fieldValueToAdd + "']";
    +            }
    +
    +
    +
    +            UpdateRequest upReq = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(script))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +
    +
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +        //First Search of the field exists or not
    +        SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.existsQuery(fieldToRemove))
    +                .get();
    +
    +        String script = null;
    +        if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +            //This means field already exists and therefore we should remove from it
    +            script = "ctx._source." + fieldToRemove+".remove('"+ valueToRemove + "')";
    +        }
    +
    +        UpdateRequest updateRequest = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +//        client.prepareDelete(INDEX_NAME, docDef.getEntityField(), entity.getId()).get();
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    +
    +        GetResponse response;
    +        Map<String, Node> result = new HashMap<>();
    +
    +        if(uri != null) {
    +            response = client.prepareGet(INDEX_NAME, docDef.getEntityField(), uri).get();
    +            if(response != null && !response.isSourceEmpty()) {
    +                String entityField = response.getId();
    +                Node entity = NodeFactory.createURI(entityField) ;
    +                result.put(docDef.getEntityField(), entity);
    +                for (String field: docDef.fields()) {
    +
    +                    GetField fieldResponse = response.getField(field);
    +
    +                    if(fieldResponse == null || fieldResponse.getValue() == null) {
    +                        //We wont return it.
    +                        continue;
    +                    }
    +                    if(fieldResponse instanceof List<?>) {
    +                        //We are only interested in literal values
    +                        continue;
    +                    }
    +                    //We assume it will always be a String value.
    +                    String fieldValue = (String)fieldResponse.getValue();
    +                    Node fieldNode = NodeFactoryExtra.createLiteralNode(fieldValue, null, null);
    +                    result.put(field, fieldNode);
    +
    +                }
    +
    +
    +            }
    +        }
    +
    +        return result;
    +    }
    +
    +    @Override
    +    public List<TextHit> query(Node property, String qs) {
    +
    +        return query(property, qs, 0);
    --- End diff --
    
    Agree. ES has the same although it is a configurable option. I have defaulted it to 10000 for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107681104
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/assembler/TextIndexESAssembler.java ---
    @@ -0,0 +1,129 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text.assembler;
    +
    +import org.apache.jena.assembler.Assembler;
    +import org.apache.jena.assembler.Mode;
    +import org.apache.jena.assembler.assemblers.AssemblerBase;
    +import org.apache.jena.query.text.*;
    +import org.apache.jena.rdf.model.RDFNode;
    +import org.apache.jena.rdf.model.Resource;
    +import org.apache.jena.rdf.model.Statement;
    +import org.apache.jena.sparql.util.graph.GraphUtils;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +import static org.apache.jena.query.text.assembler.TextVocab.*;
    +
    +public class TextIndexESAssembler extends AssemblerBase {
    +
    +    private static Logger LOGGER      = LoggerFactory.getLogger(TextIndexESAssembler.class) ;
    +
    +    protected static final String COMMA = ",";
    +    protected static final String COLON = ":";
    +    /*
    +    <#index> a :TextIndexES ;
    +        text:serverList "127.0.0.1:9300,127.0.0.2:9400,127.0.0.3:9500" ; #Comma separated list of hosts:ports
    +        text:clusterName "elasticsearch"
    +        text:shards "1"
    +        text:replicas "1"
    +        text:entityMap <#endMap> ;
    +        .
    +    */
    +    
    +    @SuppressWarnings("resource")
    +    @Override
    +    public TextIndex open(Assembler a, Resource root, Mode mode) {
    +        try {
    +            String listOfHostsAndPorts = GraphUtils.getAsStringValue(root, pServerList) ;
    +            if(listOfHostsAndPorts == null || listOfHostsAndPorts.isEmpty()) {
    +                throw new TextIndexException("Mandatory property text:serverList (containing the comma-separated list of host:port) property is not specified. " +
    +                        "An example value for the property: 127.0.0.1:9300");
    +            }
    +            String[] hosts = listOfHostsAndPorts.split(COMMA);
    +            Map<String,Integer> hostAndPortMapping = new HashMap<>();
    +            for(String host : hosts) {
    +                String[] hostAndPort = host.split(COLON);
    +                if(hostAndPort.length < 2) {
    +                    LOGGER.error("Either the host or the port value is missing.Please specify the property in host:port format. " +
    +                            "Both parts are mandatory. Ignoring this value. Moving to the next one.");
    +                    continue;
    +                }
    +                hostAndPortMapping.put(hostAndPort[0], Integer.valueOf(hostAndPort[1]));
    +            }
    +
    +            String clusterName = GraphUtils.getAsStringValue(root, pClusterName);
    +            if(clusterName == null || clusterName.isEmpty()) {
    +                LOGGER.warn("ClusterName property is not specified. Defaulting to 'elasticsearch'");
    +                clusterName = "elasticsearch";
    +            }
    +
    +            String numberOfShards = GraphUtils.getAsStringValue(root, pShards);
    +            if(numberOfShards == null || numberOfShards.isEmpty()) {
    +                LOGGER.warn("shards property is not specified. Defaulting to '1'");
    +                numberOfShards = "1";
    +            }
    +
    +            String replicationFactor = GraphUtils.getAsStringValue(root, pReplicas);
    +            if(replicationFactor == null || replicationFactor.isEmpty()) {
    +                LOGGER.warn("replicas property is not specified. Defaulting to '1'");
    +                replicationFactor = "1";
    +            }
    +
    +            String indexName = GraphUtils.getAsStringValue(root, pIndexName);
    +            if(indexName == null || indexName.isEmpty()) {
    +                LOGGER.warn("index Name property is not specified. Defaulting to 'jena-text'");
    +                indexName = "jena-text";
    +            }
    +
    +            boolean isMultilingualSupport = false;
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106404607
  
    --- Diff: jena-text/text-config.ttl ---
    @@ -50,6 +50,7 @@ text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
     <#indexLucene> a text:TextIndexLucene ;
         #text:directory <file:Lucene> ;
         text:directory "mem" ;
    +    text:multilingualSupport true ;
    --- End diff --
    
    Is this change necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106145236
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    --- End diff --
    
    This is OK. However, I noticed that TextIndex.updateEntity is never actually called from within Jena - there are only addition and deletion events but no updates as such. It is possible that this method will be removed from jena-text.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/jena/pull/227


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106427299
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +
    +        String script = "if(ctx._source.<fieldToRemove> != null && (ctx._source.<fieldToRemove>.empty != true) " +
    +                "&& (ctx._source.<fieldToRemove>.indexOf('<valueToRemove>') >= 0)) " +
    +                "{ctx._source.<fieldToRemove>.remove(ctx._source.<fieldToRemove>.indexOf('<valueToRemove>'))}";
    +        script = script.replaceAll("<fieldToRemove>", fieldToRemove).replaceAll("<valueToRemove>", valueToRemove);
    +
    +        UpdateRequest updateRequest = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    --- End diff --
    
    Cool


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106429946
  
    --- Diff: jena-text/pom.xml ---
    @@ -81,6 +81,51 @@
           <artifactId>lucene-queryparser</artifactId>
         </dependency>
     
    +      <dependency>
    +          <groupId>org.elasticsearch</groupId>
    +          <artifactId>elasticsearch</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.elasticsearch.client</groupId>
    +          <artifactId>transport</artifactId>
    +      </dependency>
    +
    +      <dependency>
    --- End diff --
    
    Will test and remove if unnecessary


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107435245
  
    --- Diff: jena-text/pom.xml ---
    @@ -112,11 +141,77 @@
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-surefire-plugin</artifactId>
             <configuration>
    -          <includes>
    -            <include>**/TS_*.java</include>
    -          </includes>
    +            <!-- Skip the default running of this plug-in (or everything is run twice...see below) -->
    +            <skip>true</skip>
    +
    +            <!-- Required to bypass Embedded ES security checks, especially JAR Hell-->
    +            <argLine>-Dtests.security.manager=false</argLine>
    --- End diff --
    
    is this still needed if we are not using the embedded ES?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106431068
  
    --- Diff: jena-text/text-config.ttl ---
    @@ -50,6 +50,7 @@ text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
     <#indexLucene> a text:TextIndexLucene ;
         #text:directory <file:Lucene> ;
         text:directory "mem" ;
    +    text:multilingualSupport true ;
    --- End diff --
    
    Same as above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    Thanks for taking a look @ajs6f. I will wait for your review before proceeding


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107672301
  
    --- Diff: jena-text/src/main/resources/text-config-es.ttl ---
    @@ -0,0 +1,65 @@
    +    # Licensed to the Apache Software Foundation (ASF) under one
    +    # or more contributor license agreements.  See the NOTICE file
    +    # distributed with this work for additional information
    +    # regarding copyright ownership.  The ASF licenses this file
    +    # to you under the Apache License, Version 2.0 (the
    +    # "License"); you may not use this file except in compliance
    +    # with the License.  You may obtain a copy of the License at
    +    #
    +    #     http://www.apache.org/licenses/LICENSE-2.0
    +    #
    +    # Unless required by applicable law or agreed to in writing, software
    +    # distributed under the License is distributed on an "AS IS" BASIS,
    +    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    # See the License for the specific language governing permissions and
    +    # limitations under the License.
    +
    + ## Example of a TDB dataset and text index for ElasticSearch
    +
    +@prefix :        <http://localhost/jena_example/#> .
    +@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    +@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
    +@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
    +@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
    +@prefix text:    <http://jena.apache.org/text#> .
    +
    +# TDB
    +[] ja:loadClass "org.apache.jena.tdb.TDB" .
    +tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
    +tdb:GraphTDB    rdfs:subClassOf  ja:Model .
    +
    +# Text
    +[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
    +text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
    +text:TextIndexES      rdfs:subClassOf   text:TextIndex .
    +
    +## ---------------------------------------------------------------
    +## This URI must be fixed - it's used to assemble the text dataset.
    +
    +:text_dataset rdf:type     text:TextDataset ;
    +    text:dataset   <#dataset> ;
    +    text:index     <#indexES> ;
    +    .
    +
    +<#dataset> rdf:type      tdb:DatasetTDB ;
    +    tdb:location "--mem--" ;
    +    .
    +
    +<#indexES> a text:TextIndexES ;
    +    text:serverList "127.0.0.1:9300" ; # A comma-separated list of Host:Port values of the ElasticSearch Cluster nodes.
    +    text:clusterName "elasticsearch" ; # Name of the ElasticSearch Cluster. If not specified defaults to 'elasticsearch'
    +    text:shards "1" ;                  # The number of shards for the index. Defaults to 1
    +    text:replicas "1" ;                # The number of replicas for the index. Defaults to 1
    +    text:indexName "jena-text" ;       # Name of the Index. defaults to jena-text
    +    text:multilingualSupport true ;
    --- End diff --
    
    no longer necessary, should be removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @anujgandharv I tested it and got the same error.
    I found some hints in this issue report: https://github.com/elastic/elasticsearch/issues/22494
    The problem seems to be that the ES test framework is incompatible with the ES Maven plugin. So you cannot use both kinds of tests in the same project, at least not without some extra effort.
    
    What I did was
    - remove (actually rename) TestTextIndexES.java
    - remove the reference to TestTextIndexES.java from TS_Text.java
    - remove the dependency on `org.elasticsearch.test` as well as the following block dealing with junit and hamcrest from pom.xml
    
    After these I was able to run the integration test using `mvn clean verify`:
    ```
    -------------------------------------------------------
    Running org.apache.jena.query.text.it.TextIndexESIT
    ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.
    Testing simple
    Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.841 sec - in org.apache.jena.query.text.it.TextIndexESIT
    
    Results :
    
    Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    I think it would help the review if you rebased your branch on top of current apache master. Now it's hard to see what are your ES changes and what came from the included commits. Dropping Solr in particular caused large diffs and now all these commits mixed up in this PR.
    
    I haven't yet run the included tests. They don't seem to be run from `mvn test`. I would prefer tests that are automatically run from `mvn test`, since then they get tested in the CI builds, if that's possible. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106636373
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +
    +        String script = "if(ctx._source.<fieldToRemove> != null && (ctx._source.<fieldToRemove>.empty != true) " +
    +                "&& (ctx._source.<fieldToRemove>.indexOf('<valueToRemove>') >= 0)) " +
    +                "{ctx._source.<fieldToRemove>.remove(ctx._source.<fieldToRemove>.indexOf('<valueToRemove>'))}";
    +        script = script.replaceAll("<fieldToRemove>", fieldToRemove).replaceAll("<valueToRemove>", valueToRemove);
    +
    +        UpdateRequest updateRequest = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    +
    +        GetResponse response;
    +        Map<String, Node> result = new HashMap<>();
    +
    +        if(uri != null) {
    +            response = client.prepareGet(indexName, docDef.getEntityField(), uri).get();
    +            if(response != null && !response.isSourceEmpty()) {
    +                String entityField = response.getId();
    +                Node entity = NodeFactory.createURI(entityField) ;
    +                result.put(docDef.getEntityField(), entity);
    +                Map<String, Object> source = response.getSource();
    +                for (String field: docDef.fields()) {
    +                    Object fieldResponse = source.get(field);
    +
    +                    if(fieldResponse == null) {
    +                        //We wont return it.
    +                        continue;
    +                    }
    +                    else if(fieldResponse instanceof List<?>) {
    +                        //We are storing the values of fields as a List always.
    +                        //If there are values stored in the list, then we return the first value,
    +                        // else we do not include the field in the returned Map of Field -> Node Mapping
    +                        List<?> responseList = (List<?>)fieldResponse;
    +                        if(responseList != null && responseList.size() > 0) {
    +                            String fieldValue = (String)responseList.get(0);
    +                            Node fieldNode = NodeFactoryExtra.createLiteralNode(fieldValue, null, null);
    +                            result.put(field, fieldNode);
    +                        }
    +                    }
    +                }
    +            }
    +        }
    +
    +        return result;
    +    }
    +
    +    @Override
    +    public List<TextHit> query(Node property, String qs) {
    +
    +        return query(property, qs, MAX_RESULTS);
    +    }
    +
    +    /**
    +     * Query the ElasticSearch for the given Node, with the given query String and limit.
    +     * @param property the node property to make a search for
    +     * @param qs the query string
    +     * @param limit limit on the number of records to return
    +     * @return List of {@link TextHit}s containing the documents that have been found
    +     */
    +    @Override
    +    public List<TextHit> query(Node property, String qs, int limit) {
    +
    +        qs = parse(qs);
    +        LOGGER.debug("Querying ElasticSearch for QueryString: " + qs);
    +        SearchResponse response = client.prepareSearch(indexName)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.queryStringQuery(qs))
    +                .setFrom(0).setSize(limit)
    +                .get();
    +
    +        List<TextHit> results = new ArrayList<>() ;
    +        for (SearchHit hit : response.getHits()) {
    +
    +            Node literal;
    +            String field = (property != null) ? docDef.getField(property) : docDef.getPrimaryField();
    +            List<String> value = (List<String>)hit.getSource().get(field);
    +            if(value != null) {
    +                literal = NodeFactory.createLiteral(value.get(0));
    --- End diff --
    
    So Just to be on the same page, here is my current Add and Query Algorithm. Can you please clarify my assumptions.
    
    _ASSUMPTION_: `rdfs:label` is mapped to a field with name: `label` in my Index
    ## _*ADD Functionality*_
    * If I get a triple like this: `:x1 rdfs:label "X2 word"@en .` I store it in ES under the field: `label_en` ONLY with value : `["X2 word"]` . Note that it is an Array.
    * Next, If I get the next triple as : `:x1 rdfs:label "X1 another word" .`  , I store it in ES under the field : `label` ONLY with value: `["X1 another word"]` . Note that it is an Array.
    Thus at this point my ES Document with id: `http://example/x1` has two fields: `label` and `label_en`.
    * Next I get another triple as: `:x1 rdfs:label "X3 yet another word"@en .`, I store it in ES under the field: `label_en` ONLY. Now the `label_en` field has values: `["X2 word", "X3 yet another word"]` 
    * Finally I get another triple as: `:x1 rdfs:label "X3 Simple" .`, I store it in ES under the field: `label` ONLY. Now the `label` field has values: `["X1 another word", "X3 Simple"]` 
    
    Let's assume the above is the state of the Index at Query Time. i.e. I have ONE Document in ES with id: `http://example/x1` and that this document has two fields:
    `label`:`["X1 another word", "X3 Simple"]`
    `label_en`:`["X2 word", "X3 yet another word"]`
    
    ## _QUERY Functionality_
    ### CASE 1: When Query is `('X3')`
    In this case the ES implementation will see that there is no language specific search required by the user and thus it will create a text query that will search in all the related fields, i.e language-specific AS WELL AS language-agnostic. The text query will have the syntax: `X1`
    Thus the results returned will contain both `label` as well as `label_en` field.
    Next since the Node Property is NULL for the above query, I will take the `docDef.getPrimaryField()`  (which in may case has a value `label`) as the field whose value I should return. Thus I take the value of `label` field which is `["X1 another word", "X3 Simple"]` and currently return the first value: `X1 another word`. Clearly this is NOT the right Value that I should return. I should instead return ALL the literals that have `X3` as a string in them. Is that the correct assumption?
    
    ###Case 2: When Query is `rdfs:label 'X3'`
    As above I am doing both language-specific as well as language-agnostic query and the actual query string will look like: `label*:X3`. Thus as above I will return the wrong value `X1 another word`.  I should instead return ALL the literals that have `X3` as a string in them. Is that the correct assumption?
    
    ###Case 3: When Query is `rdfs:label 'X3' lang:en`
    In this case, the query String will look like: `label_en:X3`. Thus I am searching for ONLY language specific field and NOT all the fields. Once I get the result, I then return the first value ONLY. I should instead return ALL the literals that have `X3` as a string in them. Is that the correct assumption?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106144221
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    --- End diff --
    
    Although this is final and only changed from the constructor, I think it should be changed to "indexName" to reflect that it's a settable parameter and not really a constant


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    Squashed into a single commit and merged into apache master:
    https://github.com/apache/jena/commit/1c1325c5646f3fd908bf56db0480759a22dcd68c
    
    I decided to squash because there were almost 30 commits, some of them touching and then reverting unrelated files etc. The squashed version that I merged to master is clean, but doesn't incorporate all the history. For the history, GitHub will still retain this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106420753
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/TestTextIndexES.java ---
    @@ -0,0 +1,184 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.jena.query.text;
    +
    +
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.vocabulary.RDFS;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.test.ESIntegTestCase;
    +import org.junit.Assert;
    +import org.junit.Ignore;
    +import org.junit.Test;
    +
    +import java.util.List;
    +import java.util.Map;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + *
    + * Integration test for {@link TextIndexES} class
    + * ES Integration test depends on security policies that may sometime not be loaded properly.
    + * If you find any issues regarding security set the following VM argument to resolve the issue:
    + * -Dtests.security.manager=false
    + *
    + */
    +@ESIntegTestCase.ClusterScope()
    +public class TestTextIndexES extends ESIntegTestCase {
    --- End diff --
    
    The main issue is that embedded ElasticSearch does not come with the "pianless" plugin and ElasticSearch has stopped releasing painless plugin as maven artifacts. Therefore I can not have tests that rely on the script portion to be executed. In order to do extensive testing, I need this plugin. That is also the reason why the delete tests are ignored currently.
    Let me see if I can add some more unit tests, although IMO they would still be simple ones.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106404182
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +
    +        String script = "if(ctx._source.<fieldToRemove> != null && (ctx._source.<fieldToRemove>.empty != true) " +
    +                "&& (ctx._source.<fieldToRemove>.indexOf('<valueToRemove>') >= 0)) " +
    +                "{ctx._source.<fieldToRemove>.remove(ctx._source.<fieldToRemove>.indexOf('<valueToRemove>'))}";
    +        script = script.replaceAll("<fieldToRemove>", fieldToRemove).replaceAll("<valueToRemove>", valueToRemove);
    +
    +        UpdateRequest updateRequest = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    --- End diff --
    
    As discussed already, this method may go away. For now, just don't waste any more time on it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107673859
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/it/TextIndexESIT.java ---
    @@ -0,0 +1,282 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.jena.query.text.it;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.query.text.Entity;
    +import org.apache.jena.query.text.TextHit;
    +import org.apache.jena.vocabulary.RDFS;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.junit.Assert;
    +import org.junit.Test;
    +
    +import java.util.List;
    +import java.util.Map;
    +import java.util.concurrent.TimeUnit;
    +
    +/**
    + * Integration test class for {@link org.apache.jena.query.text.TextIndexES}
    + */
    +public class TextIndexESIT extends BaseESTest {
    +
    +    @Test
    +    public void testAddEntity() {
    +        String labelKey = "label";
    +        String labelValue = "this is a sample Label";
    +        Assert.assertNotNull(classToTest);
    +        Entity entityToAdd = entity("http://example/x3", labelKey, labelValue);
    +        GetResponse response = addEntity(entityToAdd);
    +        Assert.assertTrue(response.getSource().containsKey(labelKey));
    +        Assert.assertEquals(labelValue, ((List)response.getSource().get(labelKey)).get(0));
    +    }
    +
    +    @Test
    +    public void testDeleteEntity() {
    +        testAddEntity();
    +        String labelKey = "label";
    +        String labelValue = "this is a sample Label";
    +        //Now Delete the entity
    +        classToTest.deleteEntity(entity("http://example/x3", labelKey, labelValue));
    +
    +        //Try to find it
    +        GetResponse response = transportClient.prepareGet(INDEX_NAME, DOC_TYPE, "http://example/x3").get();
    +        //It Should Exist
    +        Assert.assertTrue(response.isExists());
    +        //But the field value should now be empty
    +        Assert.assertEquals("http://example/x3", response.getId());
    +        Assert.assertTrue(response.getSource().containsKey(labelKey));
    +        Assert.assertEquals(0, ((List)response.getSource().get(labelKey)).size());
    +    }
    +
    +    @Test
    +    public void testDeleteWhenNoneExists() {
    +
    +        GetResponse response = transportClient.prepareGet(INDEX_NAME, DOC_TYPE, "http://example/x3").get();
    +        Assert.assertFalse(response.isExists());
    +        Assert.assertNotNull(classToTest);
    +        classToTest.deleteEntity(entity("http://example/x3", "label", "doesnt matter"));
    +        response = transportClient.prepareGet(INDEX_NAME, DOC_TYPE, "http://example/x3").get();
    +        Assert.assertFalse(response.isExists());
    +
    +    }
    +
    +    @Test
    +    public void testQuery() {
    +        testAddEntity();
    +        // This will search for value "this" across all the fields in all the documents
    +        List<TextHit> result =  classToTest.query(RDFS.label.asNode(), "this", 10);
    +        Assert.assertNotNull(result);
    +        Assert.assertEquals(1, result.size());
    +
    +        //This will search for value "this" only in the label field
    +        result =  classToTest.query(RDFS.label.asNode(), "label:this", 10);
    +        Assert.assertNotNull(result);
    +        Assert.assertEquals(1, result.size());
    +
    +        //This will search for value "this" in the label_en field, if it exists. In this case it doesnt so we should get zero results
    +        result =  classToTest.query(RDFS.label.asNode(), "label:this AND lang:en", 10);
    +        Assert.assertNotNull(result);
    +        Assert.assertEquals(0, result.size());
    +
    +    }
    +
    +    @Test
    +    public void testQueryWhenNoneExists() {
    +        List<TextHit> result =  classToTest.query(RDFS.label.asNode(), "this", 1);
    +        Assert.assertNotNull(result);
    +        Assert.assertEquals(0, result.size());
    +    }
    +
    +    @Test
    +    public void testGet() {
    +        testAddEntity();
    +        //Now Get the same entity
    +        Map<String, Node> response = classToTest.get("http://example/x3");
    +        Assert.assertNotNull(response);
    +        Assert.assertEquals(2, response.size());
    +    }
    +
    +    @Test
    +    public void testGetWhenNoneExists() {
    +        Map<String, Node> response = classToTest.get("http://example/x3");
    +        Assert.assertNotNull(response);
    +        Assert.assertEquals(0, response.size());
    +    }
    +
    +    /**
    +     * This is an elaborate test that does the following:
    +     * 1. Create a Document with ID: "http://example/x3" , label: Germany and lang:en
    +     * 2. Makes sure the document is created successfully and is searchable based on the label
    +     * 3. Next add another label to the same Entity with ID: "http://example/x3", label:Deutschland and lang:de
    +     * 4. Makes sure that the document is searchable both with old (Germany) and new (Deutschland) values.
    +     * 5. Next, it deletes the value: Germany created in step 1.
    +     * 6. Makes sure that document is searchable with value: Deutschland but NOT with value: Germany
    +     * 7. Finally, delete the value: Deutschland
    +     * 8. The document should not be searchable with value: Deutschland
    +     * 9. The document should still exist
    +     */
    +    @Test
    +    public void testMultipleValuesinMultipleLanguages() throws InterruptedException{
    +        addEntity(entity("http://example/x3", "label", "Germany", "en"));
    +        List<TextHit> result =  classToTest.query(RDFS.label.asNode(), "Germany", 10);
    +        Assert.assertNotNull(result);
    +        Assert.assertEquals(1, result.size());
    +        Assert.assertEquals("http://example/x3", result.get(0).getNode().getURI());
    +        //Next add another label to the same entity
    +        addEntity(entity("http://example/x3", "label", "Deutschland", "de"));
    +        //Query with old value
    +        result =  classToTest.query(RDFS.label.asNode(), "Germany", 10);
    +        Assert.assertEquals(1, result.size());
    +        Assert.assertEquals("http://example/x3", result.get(0).getNode().getURI());
    +
    +        //Query with new value
    +        result =  classToTest.query(RDFS.label.asNode(), "Deutschland", 10);
    +        Assert.assertEquals(1, result.size());
    +        Assert.assertEquals("http://example/x3", result.get(0).getNode().getURI());
    +
    +        //Now lets delete the German label
    --- End diff --
    
    minor nitpick, but this is the "Germany" label, not the German label :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106144910
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    --- End diff --
    
    I wonder whether this class works properly when multilingual mode is disabled. Particularly for the `rdfs:label "Berlin"@de, "Berlin"@en` case where one of them is removed, does the index still understand that one "Berlin" value should be retained? I suggest that what is currently the multilingual mode should be the only possible mode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107670675
  
    --- Diff: jena-parent/pom.xml ---
    @@ -275,6 +276,27 @@
             <version>${ver.spatial4j}</version>
           </dependency>
     
    +      <!-- ES dependencies-->
    +      <dependency>
    +        <groupId>org.elasticsearch</groupId>
    +        <artifactId>elasticsearch</artifactId>
    +        <version>${ver.elasticsearch}</version>
    +      </dependency>
    +
    +      <dependency>
    +        <groupId>org.elasticsearch.client</groupId>
    +        <artifactId>transport</artifactId>
    +        <version>${ver.elasticsearch}</version>
    +      </dependency>
    +
    +
    +      <dependency>
    --- End diff --
    
    I don't think the dependency on lucene-test-framework is needed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106870357
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +
    +        String script = "if(ctx._source.<fieldToRemove> != null && (ctx._source.<fieldToRemove>.empty != true) " +
    +                "&& (ctx._source.<fieldToRemove>.indexOf('<valueToRemove>') >= 0)) " +
    +                "{ctx._source.<fieldToRemove>.remove(ctx._source.<fieldToRemove>.indexOf('<valueToRemove>'))}";
    +        script = script.replaceAll("<fieldToRemove>", fieldToRemove).replaceAll("<valueToRemove>", valueToRemove);
    +
    +        UpdateRequest updateRequest = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    +
    +        GetResponse response;
    +        Map<String, Node> result = new HashMap<>();
    +
    +        if(uri != null) {
    +            response = client.prepareGet(indexName, docDef.getEntityField(), uri).get();
    +            if(response != null && !response.isSourceEmpty()) {
    +                String entityField = response.getId();
    +                Node entity = NodeFactory.createURI(entityField) ;
    +                result.put(docDef.getEntityField(), entity);
    +                Map<String, Object> source = response.getSource();
    +                for (String field: docDef.fields()) {
    +                    Object fieldResponse = source.get(field);
    +
    +                    if(fieldResponse == null) {
    +                        //We wont return it.
    +                        continue;
    +                    }
    +                    else if(fieldResponse instanceof List<?>) {
    +                        //We are storing the values of fields as a List always.
    +                        //If there are values stored in the list, then we return the first value,
    +                        // else we do not include the field in the returned Map of Field -> Node Mapping
    +                        List<?> responseList = (List<?>)fieldResponse;
    +                        if(responseList != null && responseList.size() > 0) {
    +                            String fieldValue = (String)responseList.get(0);
    +                            Node fieldNode = NodeFactoryExtra.createLiteralNode(fieldValue, null, null);
    +                            result.put(field, fieldNode);
    +                        }
    +                    }
    +                }
    +            }
    +        }
    +
    +        return result;
    +    }
    +
    +    @Override
    +    public List<TextHit> query(Node property, String qs) {
    +
    +        return query(property, qs, MAX_RESULTS);
    +    }
    +
    +    /**
    +     * Query the ElasticSearch for the given Node, with the given query String and limit.
    +     * @param property the node property to make a search for
    +     * @param qs the query string
    +     * @param limit limit on the number of records to return
    +     * @return List of {@link TextHit}s containing the documents that have been found
    +     */
    +    @Override
    +    public List<TextHit> query(Node property, String qs, int limit) {
    +
    +        qs = parse(qs);
    +        LOGGER.debug("Querying ElasticSearch for QueryString: " + qs);
    +        SearchResponse response = client.prepareSearch(indexName)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.queryStringQuery(qs))
    +                .setFrom(0).setSize(limit)
    +                .get();
    +
    +        List<TextHit> results = new ArrayList<>() ;
    +        for (SearchHit hit : response.getHits()) {
    +
    +            Node literal;
    +            String field = (property != null) ? docDef.getField(property) : docDef.getPrimaryField();
    +            List<String> value = (List<String>)hit.getSource().get(field);
    +            if(value != null) {
    +                literal = NodeFactory.createLiteral(value.get(0));
    --- End diff --
    
    This change has been done with the latest commit


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106199137
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            /**
    +             * We are creating an upsert request here instead of a simple insert request.
    +             * The reason is we want to add a document if it does not exist with the given Subject Id (URI).
    +             * But if the document exists with the same Subject Id, we want to do an update to it instead of deleting it and
    +             * then creating it with only the latest field values.
    +             * This functionality is called Upsert functionality and more can be learned about it here:
    +             * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
    +             */
    +
    +            //First Search of the field exists or not
    +            SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                    .setTypes(docDef.getEntityField())
    +                    .setQuery(QueryBuilders.existsQuery(fieldToAdd))
    +                    .get();
    +            String script;
    +            if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +                //This means field already exists and therefore we should append to it
    +                script = "ctx._source." + fieldToAdd+".add('"+ fieldValueToAdd + "')";
    +            } else {
    +                //The field does not exists. so we create one
    +                script = "ctx._source." + fieldToAdd+" =['"+ fieldValueToAdd + "']";
    +            }
    +
    +
    +
    +            UpdateRequest upReq = new UpdateRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(script))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +
    +
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +        //First Search of the field exists or not
    +        SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.existsQuery(fieldToRemove))
    +                .get();
    +
    +        String script = null;
    +        if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    --- End diff --
    
    Agree completely. Will remove unwanted calls and do everything in a single script.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106199017
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            /**
    +             * We are creating an upsert request here instead of a simple insert request.
    +             * The reason is we want to add a document if it does not exist with the given Subject Id (URI).
    +             * But if the document exists with the same Subject Id, we want to do an update to it instead of deleting it and
    +             * then creating it with only the latest field values.
    +             * This functionality is called Upsert functionality and more can be learned about it here:
    +             * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
    +             */
    +
    +            //First Search of the field exists or not
    +            SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                    .setTypes(docDef.getEntityField())
    +                    .setQuery(QueryBuilders.existsQuery(fieldToAdd))
    +                    .get();
    +            String script;
    +            if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +                //This means field already exists and therefore we should append to it
    +                script = "ctx._source." + fieldToAdd+".add('"+ fieldValueToAdd + "')";
    --- End diff --
    
    I will try to make it more performant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @ajs6f Thanks a lot!
    
    Since it seems that remaining issues can be sorted out after merging, I will continue with that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r108439061
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/it/BaseESTest.java ---
    @@ -0,0 +1,111 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.jena.query.text.it;
    +
    +import org.apache.jena.query.text.EntityDefinition;
    +import org.apache.jena.query.text.TextIndexConfig;
    +import org.apache.jena.query.text.TextIndexES;
    +import org.apache.jena.vocabulary.RDFS;
    +import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.junit.After;
    +import org.junit.Assert;
    +import org.junit.Before;
    +import org.junit.BeforeClass;
    +
    +import java.net.InetAddress;
    +import java.net.UnknownHostException;
    +
    +/**
    + * Base Class for ElasticSearch based Integration tests.
    + */
    +public abstract class BaseESTest {
    +
    +    protected static TransportClient transportClient;
    +
    +    private final static String ADDRESS = "127.0.0.1";
    +    private final static int PORT = 9500;
    +    private final static String CLUSTER_NAME = "elasticsearch";
    +    protected final static String INDEX_NAME = "jena-text";
    +
    +    protected static TextIndexES classToTest;
    +
    +    static final String DOC_TYPE = "text";
    +
    +    /**
    +     * Make sure that we have connectivity to the locally running ES node.
    +     * The ES is started during the pre-integration-test phase
    +     */
    +    @BeforeClass
    +    public static void setupTransportClient() {
    +
    +        Settings settings = Settings.builder().put("cluster.name", CLUSTER_NAME).build();
    +        transportClient = new PreBuiltTransportClient(settings);
    +        try {
    +            transportClient.addTransportAddress(
    +                    new InetSocketTransportAddress(InetAddress.getByName(ADDRESS), PORT)
    +            );
    +        } catch (UnknownHostException ex) {
    +            Assert.fail("Failed to create transport client" + ex.getMessage());
    +        }
    +        classToTest = new TextIndexES(config(), transportClient, INDEX_NAME);
    +        Assert.assertNotNull("Transport client was not created successfully", transportClient);
    +
    +
    +    }
    +
    +    /**
    +     * Make sure that we always start we a clean index.
    +     * This will help keep the tests isolated
    +     * @throws Exception
    +     */
    +    @Before
    --- End diff --
    
    As a future note, rather than manually setting-up and clearing-out resources like this, we could use a JUnit  `@Rule`  `ExternalResource` for concision, clarity, and easier reuse.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106238255
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    --- End diff --
    
    Right. I suppose that's fine if you don't intend to use language-specific analyzers like the Lucene index does in multilingual mode (see org.apache.jena.query.text.analyzer.MultilingualAnalyzer and Util for how these are hooked up)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106429891
  
    --- Diff: jena-parent/pom.xml ---
    @@ -275,6 +276,75 @@
             <version>${ver.spatial4j}</version>
           </dependency>
     
    +      <!-- ES dependencies-->
    +      <dependency>
    +        <groupId>org.elasticsearch</groupId>
    +        <artifactId>elasticsearch</artifactId>
    +        <version>${ver.elasticsearch}</version>
    +        <exclusions>
    +          <exclusion>
    +            <groupId>commons-logging</groupId>
    +            <artifactId>commons-logging</artifactId>
    +          </exclusion>
    +          <exclusion>
    +            <groupId>org.hamcrest</groupId>
    +            <artifactId>hamcrest-core</artifactId>
    +          </exclusion>
    +        </exclusions>
    +
    +      </dependency>
    +
    +      <dependency>
    +        <groupId>org.elasticsearch.client</groupId>
    +        <artifactId>transport</artifactId>
    +        <version>${ver.elasticsearch}</version>
    +        <exclusions>
    +          <exclusion>
    +            <groupId>commons-logging</groupId>
    +            <artifactId>commons-logging</artifactId>
    +          </exclusion>
    +          <exclusion>
    +            <groupId>org.hamcrest</groupId>
    +            <artifactId>hamcrest-core</artifactId>
    +          </exclusion>
    +        </exclusions>
    +      </dependency>
    +
    +
    +      <dependency>
    --- End diff --
    
    Will test it at my end and if not required, will remove them


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @anujgandharv That was what I meant - get rid of all the currently written ES test classes completely by moving all the existing unit tests to the new integration tests.
    
    I've suggested two sets of test scenarios in these review comments above:
    https://github.com/apache/jena/pull/227#discussion_r106405506
    https://github.com/apache/jena/pull/227#discussion_r106406514
    
    If you can implement those as integration tests then at least from my perspective that would be enough to consider this ready for merging :) I still want to take another pass at the code but perhaps I'll wait for your integration tests first as they may affect the code too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106429255
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    --- End diff --
    
    OK. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    This PR can be closed. I don't have sufficient rights to do that and GitHub didn't notice that the code already got in, probably because I squashed the commits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106143437
  
    --- Diff: jena-text/src/main/resources/text-config-es.ttl ---
    @@ -0,0 +1,65 @@
    +    # Licensed to the Apache Software Foundation (ASF) under one
    +    # or more contributor license agreements.  See the NOTICE file
    +    # distributed with this work for additional information
    +    # regarding copyright ownership.  The ASF licenses this file
    +    # to you under the Apache License, Version 2.0 (the
    +    # "License"); you may not use this file except in compliance
    +    # with the License.  You may obtain a copy of the License at
    +    #
    +    #     http://www.apache.org/licenses/LICENSE-2.0
    +    #
    +    # Unless required by applicable law or agreed to in writing, software
    +    # distributed under the License is distributed on an "AS IS" BASIS,
    +    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    # See the License for the specific language governing permissions and
    +    # limitations under the License.
    +
    + ## Example of a TDB dataset and text index for ElasticSearch
    +
    +@prefix :        <http://localhost/jena_example/#> .
    +@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    +@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
    +@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
    +@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
    +@prefix text:    <http://jena.apache.org/text#> .
    +
    +# TDB
    +[] ja:loadClass "org.apache.jena.tdb.TDB" .
    +tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
    +tdb:GraphTDB    rdfs:subClassOf  ja:Model .
    +
    +# Text
    +[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
    +text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
    +text:TextIndexES      rdfs:subClassOf   text:TextIndex .
    +
    +## ---------------------------------------------------------------
    +## This URI must be fixed - it's used to assemble the text dataset.
    +
    +:text_dataset rdf:type     text:TextDataset ;
    +    text:dataset   <#dataset> ;
    +    text:index     <#indexES> ;
    +    .
    +
    +<#dataset> rdf:type      tdb:DatasetTDB ;
    +    tdb:location "--mem--" ;
    +    .
    +
    +<#indexES> a text:TextIndexES ;
    +    text:serverList "127.0.0.1:9300" ; # A comma-separated list of Host:Port values of the ElasticSearch Cluster nodes.
    +    text:clusterName "elasticsearch" ; # Name of the ElasticSearch Cluster. If not specified defaults to 'elasticsearch'
    +    text:shards "1" ;                  # The number of shards for the index. Defaults to 1
    +    text:replicas "1" ;                # The number of replicas for the index. Defaults to 1
    +    text:indexName "jena-text" ;       # Name of the Index. defaults to jena-text
    +    text:multilingualSupport true ;
    +    text:entityMap <#entMap> ;
    +    .
    +
    +<#entMap> a text:EntityMap ;
    +    text:entityField      "intel" ; # Defines the Document Type in the ES Index
    --- End diff --
    
    The field name is puzzling, usually "uri" is used. Also the comment is strange, what does this have to do with Document Type in ES?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106147202
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            //Currently ignoring Graph field based indexing
    +//            if (docDef.getGraphField() != null) {
    +//                builder = builder.field(docDef.getGraphField(), entity.getGraph());
    +//            }
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(INDEX_NAME, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            /**
    +             * We are creating an upsert request here instead of a simple insert request.
    +             * The reason is we want to add a document if it does not exist with the given Subject Id (URI).
    +             * But if the document exists with the same Subject Id, we want to do an update to it instead of deleting it and
    +             * then creating it with only the latest field values.
    +             * This functionality is called Upsert functionality and more can be learned about it here:
    +             * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
    +             */
    +
    +            //First Search of the field exists or not
    +            SearchResponse existsResponse = client.prepareSearch(INDEX_NAME)
    +                    .setTypes(docDef.getEntityField())
    +                    .setQuery(QueryBuilders.existsQuery(fieldToAdd))
    +                    .get();
    +            String script;
    +            if(existsResponse != null && existsResponse.getHits() != null && existsResponse.getHits().totalHits() > 0) {
    +                //This means field already exists and therefore we should append to it
    +                script = "ctx._source." + fieldToAdd+".add('"+ fieldValueToAdd + "')";
    --- End diff --
    
    The ES documentation recommends that variable parts of a script be expressed as parameters. This is more efficient as the script may be compiled just once instead of every time. But it's up to you whether you want to do it like that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106404092
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    --- End diff --
    
    Not sure whether this comment is relevant, since the document itself is never deleted AFAICT.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106154642
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    +
    +    static final String CLUSTER_NAME = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS = "number_of_replicas";
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) throws Exception{
    +
    +        this.INDEX_NAME = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        if(client == null) {
    +
    +            LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +            Settings settings = Settings.builder()
    +                    .put(CLUSTER_NAME, esSettings.getClusterName()).build();
    +            List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +            for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                addresses.add(addr);
    +            }
    +
    +            InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +            client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +            LOGGER.debug("Successfully initialized the client");
    +        }
    +
    +
    +        IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(INDEX_NAME)).get();
    +        if(!exists.isExists()) {
    +            Settings indexSettings = Settings.builder()
    +                    .put(NUM_OF_SHARDS, esSettings.getShards())
    +                    .put(NUM_OF_REPLICAS, esSettings.getReplicas())
    +                    .build();
    +            LOGGER.debug("Index with name " + INDEX_NAME + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +            client.admin().indices().prepareCreate(INDEX_NAME).setSettings(indexSettings).get();
    +        }
    +
    +
    +
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.INDEX_NAME = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * not really sure what we need to roll back.
    +     */
    +    @Override
    +    public void rollback() {
    +       //Not sure what to do here
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    --- End diff --
    
    Doesnt matter to us in our Use Case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @anujgandharv See here: https://jena.apache.org/download/maven.html#specifying-dependencies-on-snapshots


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106151289
  
    --- Diff: jena-text/src/main/java/examples/JenaESTextExample.java ---
    @@ -0,0 +1,65 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package examples;
    +
    +import org.apache.jena.query.Dataset;
    +import org.apache.jena.query.DatasetFactory;
    +
    +/**
    + * Simple example class to test the {@link org.apache.jena.query.text.assembler.TextIndexESAssembler}
    + * For this class to work properly, an elasticsearch node should be up and running, otherwise it will fail.
    + * You can find the details of downloading and running an ElasticSearch version here: https://www.elastic.co/downloads/past-releases/elasticsearch-5-2-1
    + * Unzip the file in your favourite directory and then execute the appropriate file under the bin directory.
    + * It will take less than a minute.
    + * In order to visualize what is written in ElasticSearch, you need to download and run Kibana: https://www.elastic.co/downloads/kibana
    + * To run kibana, just go to the bin directory and execute the appropriate file.
    + * We need to resort to this mechanism as ElasticSearch has stopped supporting embedded ElasticSearch.
    + *
    + * In addition we cant have it in the test package because ElasticSearch
    + * detects the thread origin and stops us from instantiating a client.
    + */
    +public class JenaESTextExample {
    +
    +    public static void main(String[] args) {
    +
    +        queryData(loadData(createAssembler()));
    +    }
    +
    +
    +    private static Dataset createAssembler() {
    +        String assemblerFile = "text-config-es.ttl";
    +        Dataset ds = DatasetFactory.assemble(assemblerFile,
    +                "http://localhost/jena_example/#text_dataset") ;
    +        return ds;
    +    }
    +
    +    private static Dataset loadData(Dataset ds) {
    +        JenaTextExample1.loadData(ds, "data-es.ttl");
    +        return ds;
    +    }
    +
    +    /**
    +     * The data being queried from ElasticSearch is proper but what is getting printed is wrong.
    --- End diff --
    
    Yes, fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106153769
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @anujgandharv Thanks for merging master, the diffs are now much cleaner!
    
    Regarding releases: Jena doesn't have scheduled releases. Traditionally there have been about two releases per year, but recently the goal has been to have more frequent releases, with around 3 month intervals. 3.1.1 was released in November 2016 and 3.2.0 in January. Judging by that alone, a 3.3.0 release could perhaps be made in a month or so. But this depends a lot on the state of the codebase (no known serious bugs etc.) and of course volunteer effort, so no guarantees.
    
    There are nightly Jena snapshots available from the Maven repositories, so soon after as this hits master, a 3.3.0-SNAPSHOT build can be used. You can decide for yourself whether you want to depend on that snapshot (which obviously will change quite frequently) or to maintain a local branch and use that until the 3.3.0 release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106401876
  
    --- Diff: jena-text/pom.xml ---
    @@ -115,6 +160,7 @@
               <includes>
                 <include>**/TS_*.java</include>
               </includes>
    +            <argLine>-Dtests.security.manager=false</argLine>
    --- End diff --
    
    Could add a comment here explaining why this is necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r108437539
  
    --- Diff: jena-text/pom.xml ---
    @@ -81,6 +81,32 @@
           <artifactId>lucene-queryparser</artifactId>
         </dependency>
     
    +      <dependency>
    +          <groupId>org.elasticsearch</groupId>
    +          <artifactId>elasticsearch</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.elasticsearch.client</groupId>
    +          <artifactId>transport</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>junit</groupId>
    +          <artifactId>junit</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.apache.logging.log4j</groupId>
    --- End diff --
    
    Are these dependencies coming in to support ES logging? Can we not use `log4j-over-slf4j` for that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @osma spring-elasticsearch IT is throwing exactly the same error that I am getting on my setup. What they have done is they are ignoring all the errors and assuming there are no errors, thus the tests in [BaseTest.java](https://github.com/dadoonet/spring-elasticsearch/blob/master/src/test/java/fr/pilato/spring/elasticsearch/it/BaseTest.java#L52) classes are skipped.
    They are using Spring specific [ESBeanFactory](https://github.com/dadoonet/spring-elasticsearch/blob/master/src/test/java/fr/pilato/spring/elasticsearch/it/annotation/AppConfig.java#L32) to instantiate an elastic-search client. Personally I do not want to introduce Spring dependency un-necessarily in Apache Jena because Jena is not based on Spring Framework. What are your thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107681374
  
    --- Diff: jena-text/src/main/java/examples/JenaESTextExample.java ---
    @@ -0,0 +1,94 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package examples;
    +
    +import org.apache.jena.atlas.lib.StrUtils;
    +import org.apache.jena.query.*;
    +import org.apache.jena.sparql.util.QueryExecUtils;
    +
    +/**
    + * Simple example class to test the {@link org.apache.jena.query.text.assembler.TextIndexESAssembler}
    + * For this class to work properly, an elasticsearch node should be up and running, otherwise it will fail.
    + * You can find the details of downloading and running an ElasticSearch version here: https://www.elastic.co/downloads/past-releases/elasticsearch-5-2-1
    + * Unzip the file in your favourite directory and then execute the appropriate file under the bin directory.
    + * It will take less than a minute.
    + * In order to visualize what is written in ElasticSearch, you need to download and run Kibana: https://www.elastic.co/downloads/kibana
    + * To run kibana, just go to the bin directory and execute the appropriate file.
    + * We need to resort to this mechanism as ElasticSearch has stopped supporting embedded ElasticSearch.
    + *
    + * In addition we cant have it in the test package because ElasticSearch
    + * detects the thread origin and stops us from instantiating a client.
    + */
    +public class JenaESTextExample {
    +
    +    public static void main(String[] args) {
    +
    +        queryData(loadData(createAssembler()));
    +    }
    +
    +
    +    private static Dataset createAssembler() {
    +        String assemblerFile = "text-config-es.ttl";
    +        Dataset ds = DatasetFactory.assemble(assemblerFile,
    +                "http://localhost/jena_example/#text_dataset") ;
    +        return ds;
    +    }
    +
    +    private static Dataset loadData(Dataset ds) {
    +        JenaTextExample1.loadData(ds, "data-es.ttl");
    +        return ds;
    +    }
    +
    +    /**
    +     * Query Data
    +     * @param ds
    +     */
    +    private static void queryData(Dataset ds) {
    +//        JenaTextExample1.queryData(ds);
    --- End diff --
    
    Its actually something I comment and uncomment for testing different Sparql queries. So I would prefer to keep it, if that ok.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106429678
  
    --- Diff: jena-text/pom.xml ---
    @@ -115,6 +160,7 @@
               <includes>
                 <include>**/TS_*.java</include>
               </includes>
    +            <argLine>-Dtests.security.manager=false</argLine>
    --- End diff --
    
    Will do


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    Thanks Osma for incorporating the changes into master. \U0001f44d 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106624906
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +
    +        String script = "if(ctx._source.<fieldToRemove> != null && (ctx._source.<fieldToRemove>.empty != true) " +
    +                "&& (ctx._source.<fieldToRemove>.indexOf('<valueToRemove>') >= 0)) " +
    +                "{ctx._source.<fieldToRemove>.remove(ctx._source.<fieldToRemove>.indexOf('<valueToRemove>'))}";
    +        script = script.replaceAll("<fieldToRemove>", fieldToRemove).replaceAll("<valueToRemove>", valueToRemove);
    +
    +        UpdateRequest updateRequest = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    +
    +        GetResponse response;
    +        Map<String, Node> result = new HashMap<>();
    +
    +        if(uri != null) {
    +            response = client.prepareGet(indexName, docDef.getEntityField(), uri).get();
    +            if(response != null && !response.isSourceEmpty()) {
    +                String entityField = response.getId();
    +                Node entity = NodeFactory.createURI(entityField) ;
    +                result.put(docDef.getEntityField(), entity);
    +                Map<String, Object> source = response.getSource();
    +                for (String field: docDef.fields()) {
    +                    Object fieldResponse = source.get(field);
    +
    +                    if(fieldResponse == null) {
    +                        //We wont return it.
    +                        continue;
    +                    }
    +                    else if(fieldResponse instanceof List<?>) {
    +                        //We are storing the values of fields as a List always.
    +                        //If there are values stored in the list, then we return the first value,
    +                        // else we do not include the field in the returned Map of Field -> Node Mapping
    +                        List<?> responseList = (List<?>)fieldResponse;
    +                        if(responseList != null && responseList.size() > 0) {
    +                            String fieldValue = (String)responseList.get(0);
    +                            Node fieldNode = NodeFactoryExtra.createLiteralNode(fieldValue, null, null);
    +                            result.put(field, fieldNode);
    +                        }
    +                    }
    +                }
    +            }
    +        }
    +
    +        return result;
    +    }
    +
    +    @Override
    +    public List<TextHit> query(Node property, String qs) {
    +
    +        return query(property, qs, MAX_RESULTS);
    +    }
    +
    +    /**
    +     * Query the ElasticSearch for the given Node, with the given query String and limit.
    +     * @param property the node property to make a search for
    +     * @param qs the query string
    +     * @param limit limit on the number of records to return
    +     * @return List of {@link TextHit}s containing the documents that have been found
    +     */
    +    @Override
    +    public List<TextHit> query(Node property, String qs, int limit) {
    +
    +        qs = parse(qs);
    +        LOGGER.debug("Querying ElasticSearch for QueryString: " + qs);
    +        SearchResponse response = client.prepareSearch(indexName)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.queryStringQuery(qs))
    +                .setFrom(0).setSize(limit)
    +                .get();
    +
    +        List<TextHit> results = new ArrayList<>() ;
    +        for (SearchHit hit : response.getHits()) {
    +
    +            Node literal;
    +            String field = (property != null) ? docDef.getField(property) : docDef.getPrimaryField();
    +            List<String> value = (List<String>)hit.getSource().get(field);
    +            if(value != null) {
    +                literal = NodeFactory.createLiteral(value.get(0));
    --- End diff --
    
    So @osma. I need to understand a bit more clearly the importance of the returned Literal value here. Whatever I could debug, I found that this literal value is actually not used. The only thing, IMO, that matters is the Subject URI. I am sure I am missing something, but cant seem to point exactly what. So can you please throw some more light on the importance of returning a specific literal value. In my case, when I do a search like this: (rdfs:label 'X1'), I will get back the label specific field that is actually an Array of values (because I utilize a single document to index multiple values). I can search through this arraylist to find ALL the matched literal values and create a TextIndex instance for each matched Literal, but I dont want to do it if it is not necessary. Please let me know your thoughts around it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106641289
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +
    +        String script = "if(ctx._source.<fieldToRemove> != null && (ctx._source.<fieldToRemove>.empty != true) " +
    +                "&& (ctx._source.<fieldToRemove>.indexOf('<valueToRemove>') >= 0)) " +
    +                "{ctx._source.<fieldToRemove>.remove(ctx._source.<fieldToRemove>.indexOf('<valueToRemove>'))}";
    +        script = script.replaceAll("<fieldToRemove>", fieldToRemove).replaceAll("<valueToRemove>", valueToRemove);
    +
    +        UpdateRequest updateRequest = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    +
    +        GetResponse response;
    +        Map<String, Node> result = new HashMap<>();
    +
    +        if(uri != null) {
    +            response = client.prepareGet(indexName, docDef.getEntityField(), uri).get();
    +            if(response != null && !response.isSourceEmpty()) {
    +                String entityField = response.getId();
    +                Node entity = NodeFactory.createURI(entityField) ;
    +                result.put(docDef.getEntityField(), entity);
    +                Map<String, Object> source = response.getSource();
    +                for (String field: docDef.fields()) {
    +                    Object fieldResponse = source.get(field);
    +
    +                    if(fieldResponse == null) {
    +                        //We wont return it.
    +                        continue;
    +                    }
    +                    else if(fieldResponse instanceof List<?>) {
    +                        //We are storing the values of fields as a List always.
    +                        //If there are values stored in the list, then we return the first value,
    +                        // else we do not include the field in the returned Map of Field -> Node Mapping
    +                        List<?> responseList = (List<?>)fieldResponse;
    +                        if(responseList != null && responseList.size() > 0) {
    +                            String fieldValue = (String)responseList.get(0);
    +                            Node fieldNode = NodeFactoryExtra.createLiteralNode(fieldValue, null, null);
    +                            result.put(field, fieldNode);
    +                        }
    +                    }
    +                }
    +            }
    +        }
    +
    +        return result;
    +    }
    +
    +    @Override
    +    public List<TextHit> query(Node property, String qs) {
    +
    +        return query(property, qs, MAX_RESULTS);
    +    }
    +
    +    /**
    +     * Query the ElasticSearch for the given Node, with the given query String and limit.
    +     * @param property the node property to make a search for
    +     * @param qs the query string
    +     * @param limit limit on the number of records to return
    +     * @return List of {@link TextHit}s containing the documents that have been found
    +     */
    +    @Override
    +    public List<TextHit> query(Node property, String qs, int limit) {
    +
    +        qs = parse(qs);
    +        LOGGER.debug("Querying ElasticSearch for QueryString: " + qs);
    +        SearchResponse response = client.prepareSearch(indexName)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.queryStringQuery(qs))
    +                .setFrom(0).setSize(limit)
    +                .get();
    +
    +        List<TextHit> results = new ArrayList<>() ;
    +        for (SearchHit hit : response.getHits()) {
    +
    +            Node literal;
    +            String field = (property != null) ? docDef.getField(property) : docDef.getPrimaryField();
    +            List<String> value = (List<String>)hit.getSource().get(field);
    +            if(value != null) {
    +                literal = NodeFactory.createLiteral(value.get(0));
    --- End diff --
    
    OK. I think that is a good suggestion. It also keeps things simple. 
    I think I have a mechanism to return ONLY the matched values in ES (I know it theoretically though have to test it out) but that will potentially have performance impacts and therefore I would not like to have it by default. 
    So for now, I will always return NULL values for literals instead of wrong values.
    
    Thanks again @Osma. I will make the needful changes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107681154
  
    --- Diff: jena-text/src/main/resources/text-config-es.ttl ---
    @@ -0,0 +1,65 @@
    +    # Licensed to the Apache Software Foundation (ASF) under one
    +    # or more contributor license agreements.  See the NOTICE file
    +    # distributed with this work for additional information
    +    # regarding copyright ownership.  The ASF licenses this file
    +    # to you under the Apache License, Version 2.0 (the
    +    # "License"); you may not use this file except in compliance
    +    # with the License.  You may obtain a copy of the License at
    +    #
    +    #     http://www.apache.org/licenses/LICENSE-2.0
    +    #
    +    # Unless required by applicable law or agreed to in writing, software
    +    # distributed under the License is distributed on an "AS IS" BASIS,
    +    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    # See the License for the specific language governing permissions and
    +    # limitations under the License.
    +
    + ## Example of a TDB dataset and text index for ElasticSearch
    +
    +@prefix :        <http://localhost/jena_example/#> .
    +@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    +@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
    +@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
    +@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
    +@prefix text:    <http://jena.apache.org/text#> .
    +
    +# TDB
    +[] ja:loadClass "org.apache.jena.tdb.TDB" .
    +tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
    +tdb:GraphTDB    rdfs:subClassOf  ja:Model .
    +
    +# Text
    +[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
    +text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
    +text:TextIndexES      rdfs:subClassOf   text:TextIndex .
    +
    +## ---------------------------------------------------------------
    +## This URI must be fixed - it's used to assemble the text dataset.
    +
    +:text_dataset rdf:type     text:TextDataset ;
    +    text:dataset   <#dataset> ;
    +    text:index     <#indexES> ;
    +    .
    +
    +<#dataset> rdf:type      tdb:DatasetTDB ;
    +    tdb:location "--mem--" ;
    +    .
    +
    +<#indexES> a text:TextIndexES ;
    +    text:serverList "127.0.0.1:9300" ; # A comma-separated list of Host:Port values of the ElasticSearch Cluster nodes.
    +    text:clusterName "elasticsearch" ; # Name of the ElasticSearch Cluster. If not specified defaults to 'elasticsearch'
    +    text:shards "1" ;                  # The number of shards for the index. Defaults to 1
    +    text:replicas "1" ;                # The number of replicas for the index. Defaults to 1
    +    text:indexName "jena-text" ;       # Name of the Index. defaults to jena-text
    +    text:multilingualSupport true ;
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by osma <gi...@git.apache.org>.

Github user osma commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r107672186
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/assembler/TextIndexESAssembler.java ---
    @@ -0,0 +1,129 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text.assembler;
    +
    +import org.apache.jena.assembler.Assembler;
    +import org.apache.jena.assembler.Mode;
    +import org.apache.jena.assembler.assemblers.AssemblerBase;
    +import org.apache.jena.query.text.*;
    +import org.apache.jena.rdf.model.RDFNode;
    +import org.apache.jena.rdf.model.Resource;
    +import org.apache.jena.rdf.model.Statement;
    +import org.apache.jena.sparql.util.graph.GraphUtils;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +import static org.apache.jena.query.text.assembler.TextVocab.*;
    +
    +public class TextIndexESAssembler extends AssemblerBase {
    +
    +    private static Logger LOGGER      = LoggerFactory.getLogger(TextIndexESAssembler.class) ;
    +
    +    protected static final String COMMA = ",";
    +    protected static final String COLON = ":";
    +    /*
    +    <#index> a :TextIndexES ;
    +        text:serverList "127.0.0.1:9300,127.0.0.2:9400,127.0.0.3:9500" ; #Comma separated list of hosts:ports
    +        text:clusterName "elasticsearch"
    +        text:shards "1"
    +        text:replicas "1"
    +        text:entityMap <#endMap> ;
    +        .
    +    */
    +    
    +    @SuppressWarnings("resource")
    +    @Override
    +    public TextIndex open(Assembler a, Resource root, Mode mode) {
    +        try {
    +            String listOfHostsAndPorts = GraphUtils.getAsStringValue(root, pServerList) ;
    +            if(listOfHostsAndPorts == null || listOfHostsAndPorts.isEmpty()) {
    +                throw new TextIndexException("Mandatory property text:serverList (containing the comma-separated list of host:port) property is not specified. " +
    +                        "An example value for the property: 127.0.0.1:9300");
    +            }
    +            String[] hosts = listOfHostsAndPorts.split(COMMA);
    +            Map<String,Integer> hostAndPortMapping = new HashMap<>();
    +            for(String host : hosts) {
    +                String[] hostAndPort = host.split(COLON);
    +                if(hostAndPort.length < 2) {
    +                    LOGGER.error("Either the host or the port value is missing.Please specify the property in host:port format. " +
    +                            "Both parts are mandatory. Ignoring this value. Moving to the next one.");
    +                    continue;
    +                }
    +                hostAndPortMapping.put(hostAndPort[0], Integer.valueOf(hostAndPort[1]));
    +            }
    +
    +            String clusterName = GraphUtils.getAsStringValue(root, pClusterName);
    +            if(clusterName == null || clusterName.isEmpty()) {
    +                LOGGER.warn("ClusterName property is not specified. Defaulting to 'elasticsearch'");
    +                clusterName = "elasticsearch";
    +            }
    +
    +            String numberOfShards = GraphUtils.getAsStringValue(root, pShards);
    +            if(numberOfShards == null || numberOfShards.isEmpty()) {
    +                LOGGER.warn("shards property is not specified. Defaulting to '1'");
    +                numberOfShards = "1";
    +            }
    +
    +            String replicationFactor = GraphUtils.getAsStringValue(root, pReplicas);
    +            if(replicationFactor == null || replicationFactor.isEmpty()) {
    +                LOGGER.warn("replicas property is not specified. Defaulting to '1'");
    +                replicationFactor = "1";
    +            }
    +
    +            String indexName = GraphUtils.getAsStringValue(root, pIndexName);
    +            if(indexName == null || indexName.isEmpty()) {
    +                LOGGER.warn("index Name property is not specified. Defaulting to 'jena-text'");
    +                indexName = "jena-text";
    +            }
    +
    +            boolean isMultilingualSupport = false;
    --- End diff --
    
    The multilingual parameter is no longer used in the ES implementation, so this block of code can be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106154421
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.get.GetField;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String INDEX_NAME;
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r108437929
  
    --- Diff: jena-text/pom.xml ---
    @@ -112,11 +138,72 @@
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-surefire-plugin</artifactId>
             <configuration>
    -          <includes>
    -            <include>**/TS_*.java</include>
    -          </includes>
    +            <!-- Skip the default running of this plug-in (or everything is run twice...see below) -->
    +            <skip>true</skip>
             </configuration>
    +          <executions>
    +              <execution>
    +                  <id>unit-tests</id>
    +                  <phase>test</phase>
    +                  <goals>
    +                      <goal>test</goal>
    +                  </goals>
    +                  <configuration>
    +                      <skip>false</skip>
    +                      <includes>
    +                          <include>**/TS_*.java</include>
    +                      </includes>
    +                      <excludes>
    +                          <exclude>**/*IT.java</exclude>
    +                      </excludes>
    +                  </configuration>
    +              </execution>
    +              <execution>
    +                  <id>integration-tests</id>
    +                  <phase>integration-test</phase>
    +                  <goals>
    +                      <goal>test</goal>
    +                  </goals>
    +                  <configuration>
    +                      <skip>false</skip>
    +                      <includes>
    +                          <include>**/*IT.java</include>
    --- End diff --
    
    It would nice to collect up the ITs in test suites in the same way as the unit tests are collected, just to maintain a uniform method of work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on the issue:

    https://github.com/apache/jena/pull/227
  
    Thanks @osma and @ajs6f for your inputs. Can I then suggest that instead of moving TestTextIndexES to integration tests module, lets get rid of it completely and instead have the same tests as well as more complex tests built with Maven ES Plugin. 
    
    Also, can you guys provide some test scenarios that I can work on. I will make sure to include the `Berlin` removal example. Any others?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106153611
  
    --- Diff: jena-text/pom.xml ---
    @@ -81,39 +81,50 @@
           <artifactId>lucene-queryparser</artifactId>
         </dependency>
     
    -    <!-- Solr client -->
    -    <dependency>
    -      <artifactId>solr-solrj</artifactId>
    -      <groupId>org.apache.solr</groupId>
    -    </dependency>
    -
    -    <!-- Embedded server if used for testing
    -    <dependency>
    -      <artifactId>solr-core</artifactId>
    -      <groupId>org.apache.solr</groupId>
    -      <version>${ver.solr}</version>
    -      <type>jar</type>
    -      <scope>test</scope>
    -      <optional>true</optional>
    -      <exclusions>
    -        <exclusion>
    -          <groupId>org.slf4j</groupId>
    -          <artifactId>slf4j-api</artifactId>
    -        </exclusion>
    -        <exclusion>
    -          <groupId>org.slf4j</groupId>
    -          <artifactId>slf4j-jdk14</artifactId>
    -        </exclusion>
    -      </exclusions>
    -    </dependency>
    -
    -    <dependency>
    -      <groupId>javax.servlet</groupId>
    -      <artifactId>servlet-api</artifactId>
    -      <version>2.5</version>
    -      <scope>test</scope>
    -    </dependency>
    -    -->
    +      <dependency>
    +          <groupId>org.elasticsearch</groupId>
    +          <artifactId>elasticsearch</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.elasticsearch.client</groupId>
    +          <artifactId>transport</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.apache.lucene</groupId>
    +          <artifactId>lucene-test-framework</artifactId>
    +      </dependency>
    +
    +      <dependency>
    +          <groupId>org.elasticsearch.test</groupId>
    +          <artifactId>framework</artifactId>
    +      </dependency>
    +
    +      <!-- This is required to by pass ES JAR Hell in test environment-->
    +      <dependency>
    +          <groupId>junit</groupId>
    +          <artifactId>junit</artifactId>
    +          <exclusions>
    +              <exclusion>
    +                  <groupId>org.hamcrest</groupId>
    +                  <artifactId>hamcrest-core</artifactId>
    +              </exclusion>
    +          </exclusions>
    +      </dependency>
    +
    +      <dependency>
    --- End diff --
    
    One is core and one is api dependency


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106152940
  
    --- Diff: jena-text/src/main/resources/text-config-es.ttl ---
    @@ -0,0 +1,65 @@
    +    # Licensed to the Apache Software Foundation (ASF) under one
    +    # or more contributor license agreements.  See the NOTICE file
    +    # distributed with this work for additional information
    +    # regarding copyright ownership.  The ASF licenses this file
    +    # to you under the Apache License, Version 2.0 (the
    +    # "License"); you may not use this file except in compliance
    +    # with the License.  You may obtain a copy of the License at
    +    #
    +    #     http://www.apache.org/licenses/LICENSE-2.0
    +    #
    +    # Unless required by applicable law or agreed to in writing, software
    +    # distributed under the License is distributed on an "AS IS" BASIS,
    +    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    # See the License for the specific language governing permissions and
    +    # limitations under the License.
    +
    + ## Example of a TDB dataset and text index for ElasticSearch
    +
    +@prefix :        <http://localhost/jena_example/#> .
    +@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    +@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
    +@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
    +@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
    +@prefix text:    <http://jena.apache.org/text#> .
    +
    +# TDB
    +[] ja:loadClass "org.apache.jena.tdb.TDB" .
    +tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
    +tdb:GraphTDB    rdfs:subClassOf  ja:Model .
    +
    +# Text
    +[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
    +text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
    +text:TextIndexES      rdfs:subClassOf   text:TextIndex .
    +
    +## ---------------------------------------------------------------
    +## This URI must be fixed - it's used to assemble the text dataset.
    +
    +:text_dataset rdf:type     text:TextDataset ;
    +    text:dataset   <#dataset> ;
    +    text:index     <#indexES> ;
    +    .
    +
    +<#dataset> rdf:type      tdb:DatasetTDB ;
    +    tdb:location "--mem--" ;
    +    .
    +
    +<#indexES> a text:TextIndexES ;
    +    text:serverList "127.0.0.1:9300" ; # A comma-separated list of Host:Port values of the ElasticSearch Cluster nodes.
    +    text:clusterName "elasticsearch" ; # Name of the ElasticSearch Cluster. If not specified defaults to 'elasticsearch'
    +    text:shards "1" ;                  # The number of shards for the index. Defaults to 1
    +    text:replicas "1" ;                # The number of replicas for the index. Defaults to 1
    +    text:indexName "jena-text" ;       # Name of the Index. defaults to jena-text
    +    text:multilingualSupport true ;
    +    text:entityMap <#entMap> ;
    +    .
    +
    +<#entMap> a text:EntityMap ;
    +    text:entityField      "intel" ; # Defines the Document Type in the ES Index
    --- End diff --
    
    I will change the value to "uri". In TextIndexES, I use the entity field name to define the name of the Document Type, thus the reason for the comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106430285
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexES.java ---
    @@ -0,0 +1,394 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.query.text;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.util.NodeFactoryExtra;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequest;
    +import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
    +import org.elasticsearch.action.get.GetResponse;
    +import org.elasticsearch.action.index.IndexRequest;
    +import org.elasticsearch.action.search.SearchResponse;
    +import org.elasticsearch.action.update.UpdateRequest;
    +import org.elasticsearch.action.update.UpdateResponse;
    +import org.elasticsearch.client.Client;
    +import org.elasticsearch.client.transport.TransportClient;
    +import org.elasticsearch.common.settings.Settings;
    +import org.elasticsearch.common.transport.InetSocketTransportAddress;
    +import org.elasticsearch.common.xcontent.XContentBuilder;
    +import org.elasticsearch.index.query.QueryBuilders;
    +import org.elasticsearch.script.Script;
    +import org.elasticsearch.search.SearchHit;
    +import org.elasticsearch.transport.client.PreBuiltTransportClient;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.net.InetAddress;
    +import java.util.*;
    +
    +import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
    +
    +/**
    + * Elastic Search Implementation of {@link TextIndex}
    + *
    + */
    +public class TextIndexES implements TextIndex {
    +
    +    /**
    +     * The definition of the Entity we are trying to Index
    +     */
    +    private final EntityDefinition docDef ;
    +
    +    /**
    +     * Thread safe ElasticSearch Java Client to perform Index operations
    +     */
    +    private static Client client;
    +
    +    /**
    +     * The name of the index. Defaults to 'test'
    +     */
    +    private final String indexName;
    +
    +    static final String CLUSTER_NAME_PARAM = "cluster.name";
    +
    +    static final String NUM_OF_SHARDS_PARAM = "number_of_shards";
    +
    +    static final String NUM_OF_REPLICAS_PARAM = "number_of_replicas";
    +
    +    /**
    +     * Number of maximum results to return in case no limit is specified on the search operation
    +     */
    +    static final Integer MAX_RESULTS = 10000;
    +
    +    private boolean isMultilingual ;
    +
    +    private static final Logger LOGGER      = LoggerFactory.getLogger(TextIndexES.class) ;
    +
    +    public TextIndexES(TextIndexConfig config, ESSettings esSettings) {
    +
    +        this.indexName = esSettings.getIndexName();
    +        this.docDef = config.getEntDef();
    +
    +        this.isMultilingual = config.isMultilingualSupport();
    +        if (this.isMultilingual &&  config.getEntDef().getLangField() == null) {
    +            //multilingual index cannot work without lang field
    +            docDef.setLangField("lang");
    +        }
    +        try {
    +            if(client == null) {
    +
    +                LOGGER.debug("Initializing the Elastic Search Java Client with settings: " + esSettings);
    +                Settings settings = Settings.builder()
    +                        .put(CLUSTER_NAME_PARAM, esSettings.getClusterName()).build();
    +                List<InetSocketTransportAddress> addresses = new ArrayList<>();
    +                for(String host: esSettings.getHostToPortMapping().keySet()) {
    +                    InetSocketTransportAddress addr = new InetSocketTransportAddress(InetAddress.getByName(host), esSettings.getHostToPortMapping().get(host));
    +                    addresses.add(addr);
    +                }
    +
    +                InetSocketTransportAddress socketAddresses[] = new InetSocketTransportAddress[addresses.size()];
    +                client = new PreBuiltTransportClient(settings).addTransportAddresses(addresses.toArray(socketAddresses));
    +                LOGGER.debug("Successfully initialized the client");
    +            }
    +
    +            IndicesExistsResponse exists = client.admin().indices().exists(new IndicesExistsRequest(indexName)).get();
    +            if(!exists.isExists()) {
    +                Settings indexSettings = Settings.builder()
    +                        .put(NUM_OF_SHARDS_PARAM, esSettings.getShards())
    +                        .put(NUM_OF_REPLICAS_PARAM, esSettings.getReplicas())
    +                        .build();
    +                LOGGER.debug("Index with name " + indexName + " does not exist yet. Creating one with settings: " + indexSettings.toString());
    +                client.admin().indices().prepareCreate(indexName).setSettings(indexSettings).get();
    +            }
    +        }catch (Exception e) {
    +            throw new TextIndexException("Exception occured while instantiating ElasticSearch Text Index", e);
    +        }
    +    }
    +
    +
    +    /**
    +     * Constructor used mainly for performing Integration tests
    +     * @param config an instance of {@link TextIndexConfig}
    +     * @param client an instance of {@link TransportClient}. The client should already have been initialized with an index
    +     */
    +    public TextIndexES(TextIndexConfig config, Client client, String indexName) {
    +        this.docDef = config.getEntDef();
    +        this.isMultilingual = true;
    +        this.client = client;
    +        this.indexName = indexName;
    +    }
    +
    +    /**
    +     * We do not have any specific logic to perform before committing
    +     */
    +    @Override
    +    public void prepareCommit() {
    +        //Do Nothing
    +
    +    }
    +
    +    /**
    +     * Commit happens in the individual get/add/delete operations
    +     */
    +    @Override
    +    public void commit() {
    +        // Do Nothing
    +    }
    +
    +    /**
    +     * We do not do rollback
    +     */
    +    @Override
    +    public void rollback() {
    +       //Do Nothing
    +
    +    }
    +
    +    /**
    +     * We don't have resources that need to be closed explicitely
    +     */
    +    @Override
    +    public void close() {
    +        // Do Nothing
    +
    +    }
    +
    +    /**
    +     * Update an Entity. Since we are doing Upserts in add entity anyways, we simply call {@link #addEntity(Entity)}
    +     * method that takes care of updating the Entity as well.
    +     * @param entity the entity to update.
    +     */
    +    @Override
    +    public void updateEntity(Entity entity) {
    +        //Since Add entity also updates the indexed document in case it already exists,
    +        // we can simply call the addEntity from here.
    +        addEntity(entity);
    +    }
    +
    +
    +    /**
    +     * Add an Entity to the ElasticSearch Index.
    +     * The entity will be added as a new document in ES, if it does not already exists.
    +     * If the Entity exists, then the entity will simply be updated.
    +     * The entity will never be replaced.
    +     * @param entity the entity to add
    +     */
    +    @Override
    +    public void addEntity(Entity entity) {
    +        LOGGER.debug("Adding/Updating the entity in ES");
    +
    +        //The field that has a not null value in the current Entity instance.
    +        //Required, mainly for building a script for the update command.
    +        String fieldToAdd = null;
    +        String fieldValueToAdd = "";
    +        try {
    +            XContentBuilder builder = jsonBuilder()
    +                    .startObject();
    +
    +            for(String field: docDef.fields()) {
    +                if(entity.get(field) != null) {
    +                    if(entity.getLanguage() != null && !entity.getLanguage().isEmpty() && isMultilingual) {
    +                        fieldToAdd = field + "_" + entity.getLanguage();
    +                    } else {
    +                        fieldToAdd = field;
    +                    }
    +
    +                    fieldValueToAdd = (String) entity.get(field);
    +                    builder = builder.field(fieldToAdd, Arrays.asList(fieldValueToAdd));
    +                    break;
    +                } else {
    +                    //We are making sure that the field is at-least added to the index.
    +                    //This will help us tremendously when we are appending the data later in an already indexed document.
    +                    builder = builder.field(field, Collections.emptyList());
    +                }
    +
    +            }
    +
    +            builder = builder.endObject();
    +            IndexRequest indexRequest = new IndexRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .source(builder);
    +
    +            String addUpdateScript = "if(ctx._source.<fieldName> == null || ctx._source.<fieldName>.empty) " +
    +                    "{ctx._source.<fieldName>=['<fieldValue>'] } else {ctx._source.<fieldName>.add('<fieldValue>')}";
    +            addUpdateScript = addUpdateScript.replaceAll("<fieldName>", fieldToAdd).replaceAll("<fieldValue>", fieldValueToAdd);
    +
    +            UpdateRequest upReq = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                    .script(new Script(addUpdateScript))
    +                    .upsert(indexRequest);
    +
    +            UpdateResponse response = client.update(upReq).get();
    +
    +            LOGGER.debug("Received the following Update response : " + response + " for the following entity: " + entity);
    +
    +        } catch(Exception e) {
    +            throw new TextIndexException("Unable to Index the Entity in ElasticSearch.", e);
    +        }
    +    }
    +
    +    /**
    +     * Delete an entity.
    +     * Since we are storing different predicate values within the same indexed document,
    +     * deleting the document using entity Id is sufficient to delete all the related contents for a given entity.
    +     * @param entity entity to delete
    +     */
    +    @Override
    +    public void deleteEntity(Entity entity) {
    +
    +        String fieldToRemove = null;
    +        String valueToRemove = null;
    +        for(String field : docDef.fields()) {
    +            if(entity.get(field) != null) {
    +                fieldToRemove = field;
    +                valueToRemove = (String)entity.get(field);
    +                break;
    +            }
    +        }
    +
    +        String script = "if(ctx._source.<fieldToRemove> != null && (ctx._source.<fieldToRemove>.empty != true) " +
    +                "&& (ctx._source.<fieldToRemove>.indexOf('<valueToRemove>') >= 0)) " +
    +                "{ctx._source.<fieldToRemove>.remove(ctx._source.<fieldToRemove>.indexOf('<valueToRemove>'))}";
    +        script = script.replaceAll("<fieldToRemove>", fieldToRemove).replaceAll("<valueToRemove>", valueToRemove);
    +
    +        UpdateRequest updateRequest = new UpdateRequest(indexName, docDef.getEntityField(), entity.getId())
    +                .script(new Script(script));
    +
    +        try {
    +            client.update(updateRequest).get();
    +        }catch(Exception e) {
    +            throw new TextIndexException("Unable to delete entity.", e);
    +        }
    +
    +        LOGGER.debug("deleting content related to entity: " + entity.getId());
    +
    +    }
    +
    +    /**
    +     * Get an Entity given the subject Id
    +     * @param uri the subject Id of the entity
    +     * @return a map of field name and field values;
    +     */
    +    @Override
    +    public Map<String, Node> get(String uri) {
    +
    +        GetResponse response;
    +        Map<String, Node> result = new HashMap<>();
    +
    +        if(uri != null) {
    +            response = client.prepareGet(indexName, docDef.getEntityField(), uri).get();
    +            if(response != null && !response.isSourceEmpty()) {
    +                String entityField = response.getId();
    +                Node entity = NodeFactory.createURI(entityField) ;
    +                result.put(docDef.getEntityField(), entity);
    +                Map<String, Object> source = response.getSource();
    +                for (String field: docDef.fields()) {
    +                    Object fieldResponse = source.get(field);
    +
    +                    if(fieldResponse == null) {
    +                        //We wont return it.
    +                        continue;
    +                    }
    +                    else if(fieldResponse instanceof List<?>) {
    +                        //We are storing the values of fields as a List always.
    +                        //If there are values stored in the list, then we return the first value,
    +                        // else we do not include the field in the returned Map of Field -> Node Mapping
    +                        List<?> responseList = (List<?>)fieldResponse;
    +                        if(responseList != null && responseList.size() > 0) {
    +                            String fieldValue = (String)responseList.get(0);
    +                            Node fieldNode = NodeFactoryExtra.createLiteralNode(fieldValue, null, null);
    +                            result.put(field, fieldNode);
    +                        }
    +                    }
    +                }
    +            }
    +        }
    +
    +        return result;
    +    }
    +
    +    @Override
    +    public List<TextHit> query(Node property, String qs) {
    +
    +        return query(property, qs, MAX_RESULTS);
    +    }
    +
    +    /**
    +     * Query the ElasticSearch for the given Node, with the given query String and limit.
    +     * @param property the node property to make a search for
    +     * @param qs the query string
    +     * @param limit limit on the number of records to return
    +     * @return List of {@link TextHit}s containing the documents that have been found
    +     */
    +    @Override
    +    public List<TextHit> query(Node property, String qs, int limit) {
    +
    +        qs = parse(qs);
    +        LOGGER.debug("Querying ElasticSearch for QueryString: " + qs);
    +        SearchResponse response = client.prepareSearch(indexName)
    +                .setTypes(docDef.getEntityField())
    +                .setQuery(QueryBuilders.queryStringQuery(qs))
    +                .setFrom(0).setSize(limit)
    +                .get();
    +
    +        List<TextHit> results = new ArrayList<>() ;
    +        for (SearchHit hit : response.getHits()) {
    +
    +            Node literal;
    +            String field = (property != null) ? docDef.getField(property) : docDef.getPrimaryField();
    +            List<String> value = (List<String>)hit.getSource().get(field);
    +            if(value != null) {
    +                literal = NodeFactory.createLiteral(value.get(0));
    --- End diff --
    
    Let me try to return the language tagged value. Dont know how it will work, but will give it a try


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #227: JENA-1305 | Elastic search support for Jena Text

Posted by anujgandharv <gi...@git.apache.org>.

Github user anujgandharv commented on a diff in the pull request:

    https://github.com/apache/jena/pull/227#discussion_r106431036
  
    --- Diff: jena-text/testing/TextQuery/text-config.ttl ---
    @@ -31,6 +31,7 @@ text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
     
     <#indexLucene> a text:TextIndexLucene ;
         text:directory "mem" ;
    +    text:multilingualSupport true ;
    --- End diff --
    
    Well, I was testing how multilingual works and since making this change did not break anything, I left it there. I can revert it back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #227: JENA-1305 | Elastic search support for Jena Text

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on the issue:

    https://github.com/apache/jena/pull/227
  
    @osma I'm done mumbling over this PR. I think it looks okay. I did leave some questions about the Maven stuff, but nothing in there makes me freak out. The biggest question I have is why `log4j` is showing up. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---