You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Ali Anil Sinaci <a....@gmail.com> on 2011/10/04 10:55:34 UTC

Re: Contenthub structure

Dear all,

We (SRDC team) have completed an initial version of the implementation 
of contenthub. I am going to upload a patch reflecting the changes by 
creating an issue on the Jira server.

In general, we did the following:

    * Contenthub has a new store implementation, which stores the
      documents in a Solr server.
    * Our search component is now integrated into contenthub. In
      addition to the text based search on Solr documents, it tries to
      search over user supplied ontologies as well as the enhancements
      extracted from all documents.
    * Faceted search

Let me go in details.

Regarding the Solr backend of contenthub,

    * We have used the SolrServerProviderManager and
      SolrDirectoryManager from stanbol.commons.solr to initialize an
      EmbeddedSolrServer.
    * We have created our own core files (indexes) for Solr.
    * Currently, only text files can be submitted to contenthub. Whole
      text content and supplied constraints are indexed.
    * Content items can be saved and removed. Update function is not
      implemented yet.

Regarding the search component,

    * Our old implementation were consisting of five different search
      engines. The approach was to run each of them and merge the
      results of each engine. However, this leads to efficiency problems
      as the size and number of the data increase. Currently, three
      search engines run and results are merged as they arrive. We are
      trying to come up with a unified approach to overcome the
      efficiency issues. In the end, we are planning to have Solr index
      most of the search resources.
    * In addition to the search on Solr index, we search over all
      enhancements. All enhancements are stored in a single graph on
      TCManager. This graph is indexed and search through LARQ.
      EnhancementListener mainly handles this job. In the near future,
      we are planning to get rid of this listener approach by storing
      the enhancements in Solr aligned with the content items.
    * If there is any user ontology in the system, and if the user wants
      to include that ontology in the search operation, we index that
      ontology and perform a search through LARQ. Matching ontology
      resources gives new keywords to us. This approach will also be
      improved as we unify our solution.
    * In a search using multiple keywords, currently we do not take the
      relation between the keywords into account.

Regarding the faceted search,

    * Contenthub enables storage of field:[value,] pairs through Solr
      faceted search mechanism. User is allowed to save any constraint
      (field:[value,] pair) along with the content item.
    * Our Solr index makes use of dynamic fields to index any value
      carried with the content item.
    * In the first search, facets are constructed from the fields of
      resulting documents. Later on, user is allowed to make use of the
      faceted search features.

We are planning to continue with implementing new search engines through 
a unified approach, to increase the semantic capabilities of search. For 
example, we plan to analyze city-country person-organization 
person-birthplace relations. Apart from that, we also plan to integrate 
latest version of Wordnet to increase the search facilities with 
external resources.

Regarding LMF, up to now, we have not considered any collaboration. 
However, from now on we will try not to duplicate efforts and focus on 
divergent parts of contenthub.

Kind regards,

Anil.

-------- Original Message --------
Subject: 	Fwd: Re: Contenthub structure
Date: 	Thu, 18 Aug 2011 15:01:01 +0300
From: 	Suat Gonul <su...@gmail.com>
To: 	anil@srdc.com.tr

-------- Original Message --------
Subject: 	Re: Contenthub structure
Date: 	Thu, 2 Jun 2011 10:54:15 +0200
From: 	Rupert Westenthaler <ru...@gmail.com>
Reply-To: 	stanbol-dev@incubator.apache.org
To: 	stanbol-dev@incubator.apache.org

Hi all

I will try to create a small usage Szenario here:

A user posts a query for "CMS workshops in France" to the Contenthub:

The semantic Search component of the Contenthub uses several
SeachEngines (like EnhancementEngines in the Enhancer).

1. OntologySearcher: It tries to identify Concepts mentioned in the
Search. For the example it will find the Concpet "Workshop"
2. EntitySearcher: It tries to find Entities for words used in the
Query. For the example it will find "France"
3. Faceted Search engine: It will compose a Lucene type search for
Documents with
  * a reference Workshop
  * a reference to France
  * the text "CMS"

If there would be an other Search engine that can understand internal
structure of the query one could even search for things
* with the type Workshop
* located within Paris
* the text "CMS"
and because Workshops are events one could activate Facets for
* Location
* Time
* Participants
* facets explicitly requested with the query (e.g. Tags, Creator ...)

So the Idea is to use

* Ontologies (CMS-Adapter&  Kres)
* Entityhub
* maybe neuronal networks with learned query patterns??
* other stuff??

for query preprocessing and

* full text indices over Documents
* full text indices over Facts (like the Workshop)
* SPARQL endpoints over Enhancements
* other things??

for the execution of the enhances query.

Joining results from the different sources (Documents, Facts,
Enhancements) would be challenging. However I think this feature would
not be necessary for a first version.

I would also like to consider this
[Screencast](http://www.srdc.com.tr/iks/2ndyear/DemoVideo.htm) in the
context of this Usage Scenario.

WDYT
Rupert

On Wed, Jun 1, 2011 at 10:26 AM, Olivier Grisel
<ol...@ensta.org>  wrote:
>  2011/6/1 Suat Gonul<su...@gmail.com>:
>>  Hi everbody,
>>
>>  After discussing with Rupert yesterday, we have come up with a basic design
>>  for the Contenthub component.
>>
>>  It will provide two main RESTful interface to:
>>
>>  1) Upload (register) content and metadata (Available in current
>>  implementation)
>>  2) Search for registered content
>>
>>  There would be Indexing Engines for (1) and Search Engines for (2). The
>>  Contenthub implementation would then implement Indexing Engines to store the
>>  enhancements in a triple store and Search Engines to search enhancements and
>>  content items in triple store.
>>
>>  There is also an already started implementation for the search part in
>>  google code base of IKS project at [1]. It will be integrated to the
>>  Contenthub component.
>>
>>  What do you think?
>
>  I think the default search implementation for content should be based
>  on fulltext indexing using the EntityHub's SolrYard extended with
>  faceted search.
>
>  I find fulltext search + structure facet based structured refinements
>  combo much more intuitive than the traditional multi-fields form based
>  search interface.
>
>  --
>  Olivier
>  http://twitter.com/ogrisel  -http://github.com/ogrisel
>

-- 
| Rupert Westenthalerrupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Contenthub structure

Posted by Ali Anil Sinaci <a....@gmail.com>.

Hi all,

I have submitted the patch which reflects our latest work to Stanbol 
Contenthub. In addition to the major improvements which I describe in 
the issues (https://issues.apache.org/jira/browse/STANBOL-360, 
https://issues.apache.org/jira/browse/STANBOL-361 and 
https://issues.apache.org/jira/browse/STANBOL-362), the patch includes 
several minor improvements. Let me describe the improvements as follows:

    * Enhancementlistener has been removed. Contenthub was keeping a
      graph of all enhancements by registering a listener to the
      corresponding graph of TCManager. This logic has been distributed
      among methods of SolrStore.
    * EnhancementSearchEngine is removed. Semantic relations of a
      content item are kept within the Solr document of the content
      item. These semantic relations are extracted from the enhancements
      of the document at the time of submission. The following semantic
      fields are indexed along with the content item (if any of them exist):
          o counties of cities
          o imagecaptions
          o regions (of cities, provinces etc...)
          o governors
          o capitals of countries
          o largest cities of countries
          o leader names of countries
          o given names of persons
          o knownfor fields of persons (e.g. Dennis Ritchie is knownfor
            "C Programming Language")
          o birthplaces of persons
          o work institutions of persons
          o short descriptions of persons
          o captions of persons
          o fields of persons (e.g. Alan Turing has fields
            "cryptanalysis", "computer science" and "mathematics")
    * Three default facet fields (places, people, organizations) are
      constructed from the content item's enhancements at the time of
      submission. If an entity is found with the type of "dbpedia/Place"
      or "dbpedia/Person" or "dbpedia/Organisation", the values are
      indexed accordingly to be presented to the user as a facet in the
      search results.
    * Search results are presented in a simpler form. No "ontology
      terms" are shown to the user.
    * Submitted content items can be edited/updated now. Enhancements
      are managed accordingly.
    * LARQ index on global enhancements graph is removed.
    * Recently submitted documents are retrieved from Solr by means of
      provided page&offset mechanism.

Currently, we only consider text based content. Multimedia support is 
one important thing that we will work on.

Regarding the semantic search, default cache configuration of Entityhub 
does not include the semantic relations that we process to index. To get 
the full functionality of semantic search, currently we need to remove 
the default index and run the system without any cache.

Regards,

Anil.

On 10/04/2011 11:55 AM, Ali Anil Sinaci wrote:
> Dear all,
>
> We (SRDC team) have completed an initial version of the implementation 
> of contenthub. I am going to upload a patch reflecting the changes by 
> creating an issue on the Jira server.
>
> In general, we did the following:
>
>     * Contenthub has a new store implementation, which stores the
>       documents in a Solr server.
>     * Our search component is now integrated into contenthub. In
>       addition to the text based search on Solr documents, it tries to
>       search over user supplied ontologies as well as the enhancements
>       extracted from all documents.
>     * Faceted search
>
> Let me go in details.
>
> Regarding the Solr backend of contenthub,
>
>     * We have used the SolrServerProviderManager and
>       SolrDirectoryManager from stanbol.commons.solr to initialize an
>       EmbeddedSolrServer.
>     * We have created our own core files (indexes) for Solr.
>     * Currently, only text files can be submitted to contenthub. Whole
>       text content and supplied constraints are indexed.
>     * Content items can be saved and removed. Update function is not
>       implemented yet.
>
> Regarding the search component,
>
>     * Our old implementation were consisting of five different search
>       engines. The approach was to run each of them and merge the
>       results of each engine. However, this leads to efficiency
>       problems as the size and number of the data increase. Currently,
>       three search engines run and results are merged as they arrive.
>       We are trying to come up with a unified approach to overcome the
>       efficiency issues. In the end, we are planning to have Solr
>       index most of the search resources.
>     * In addition to the search on Solr index, we search over all
>       enhancements. All enhancements are stored in a single graph on
>       TCManager. This graph is indexed and search through LARQ.
>       EnhancementListener mainly handles this job. In the near future,
>       we are planning to get rid of this listener approach by storing
>       the enhancements in Solr aligned with the content items.
>     * If there is any user ontology in the system, and if the user
>       wants to include that ontology in the search operation, we index
>       that ontology and perform a search through LARQ. Matching
>       ontology resources gives new keywords to us. This approach will
>       also be improved as we unify our solution.
>     * In a search using multiple keywords, currently we do not take
>       the relation between the keywords into account.
>
> Regarding the faceted search,
>
>     * Contenthub enables storage of field:[value,] pairs through Solr
>       faceted search mechanism. User is allowed to save any constraint
>       (field:[value,] pair) along with the content item.
>     * Our Solr index makes use of dynamic fields to index any value
>       carried with the content item.
>     * In the first search, facets are constructed from the fields of
>       resulting documents. Later on, user is allowed to make use of
>       the faceted search features.
>
>
> We are planning to continue with implementing new search engines 
> through a unified approach, to increase the semantic capabilities of 
> search. For example, we plan to analyze city-country 
> person-organization person-birthplace relations. Apart from that, we 
> also plan to integrate latest version of Wordnet to increase the 
> search facilities with external resources.
>
> Regarding LMF, up to now, we have not considered any collaboration. 
> However, from now on we will try not to duplicate efforts and focus on 
> divergent parts of contenthub.
>
> Kind regards,
>
> Anil.
>
>
>
>
> -------- Original Message --------
> Subject: 	Fwd: Re: Contenthub structure
> Date: 	Thu, 18 Aug 2011 15:01:01 +0300
> From: 	Suat Gonul <su...@gmail.com>
> To: 	anil@srdc.com.tr
>
>
>
>
>
> -------- Original Message --------
> Subject: 	Re: Contenthub structure
> Date: 	Thu, 2 Jun 2011 10:54:15 +0200
> From: 	Rupert Westenthaler <ru...@gmail.com>
> Reply-To: 	stanbol-dev@incubator.apache.org
> To: 	stanbol-dev@incubator.apache.org
>
>
>
> Hi all
>
> I will try to create a small usage Szenario here:
>
> A user posts a query for "CMS workshops in France" to the Contenthub:
>
> The semantic Search component of the Contenthub uses several
> SeachEngines (like EnhancementEngines in the Enhancer).
>
> 1. OntologySearcher: It tries to identify Concepts mentioned in the
> Search. For the example it will find the Concpet "Workshop"
> 2. EntitySearcher: It tries to find Entities for words used in the
> Query. For the example it will find "France"
> 3. Faceted Search engine: It will compose a Lucene type search for
> Documents with
>   * a reference Workshop
>   * a reference to France
>   * the text "CMS"
>
> If there would be an other Search engine that can understand internal
> structure of the query one could even search for things
> * with the type Workshop
> * located within Paris
> * the text "CMS"
> and because Workshops are events one could activate Facets for
> * Location
> * Time
> * Participants
> * facets explicitly requested with the query (e.g. Tags, Creator ...)
>
> So the Idea is to use
>
> * Ontologies (CMS-Adapter&  Kres)
> * Entityhub
> * maybe neuronal networks with learned query patterns??
> * other stuff??
>
> for query preprocessing and
>
> * full text indices over Documents
> * full text indices over Facts (like the Workshop)
> * SPARQL endpoints over Enhancements
> * other things??
>
> for the execution of the enhances query.
>
> Joining results from the different sources (Documents, Facts,
> Enhancements) would be challenging. However I think this feature would
> not be necessary for a first version.
>
> I would also like to consider this
> [Screencast](http://www.srdc.com.tr/iks/2ndyear/DemoVideo.htm) in the
> context of this Usage Scenario.
>
> WDYT
> Rupert
>
> On Wed, Jun 1, 2011 at 10:26 AM, Olivier Grisel
> <ol...@ensta.org>  wrote:
> >  2011/6/1 Suat Gonul<su...@gmail.com>:
> >>  Hi everbody,
> >>
> >>  After discussing with Rupert yesterday, we have come up with a basic design
> >>  for the Contenthub component.
> >>
> >>  It will provide two main RESTful interface to:
> >>
> >>  1) Upload (register) content and metadata (Available in current
> >>  implementation)
> >>  2) Search for registered content
> >>
> >>  There would be Indexing Engines for (1) and Search Engines for (2). The
> >>  Contenthub implementation would then implement Indexing Engines to store the
> >>  enhancements in a triple store and Search Engines to search enhancements and
> >>  content items in triple store.
> >>
> >>  There is also an already started implementation for the search part in
> >>  google code base of IKS project at [1]. It will be integrated to the
> >>  Contenthub component.
> >>
> >>  What do you think?
> >
> >  I think the default search implementation for content should be based
> >  on fulltext indexing using the EntityHub's SolrYard extended with
> >  faceted search.
> >
> >  I find fulltext search + structure facet based structured refinements
> >  combo much more intuitive than the traditional multi-fields form based
> >  search interface.
> >
> >  --
> >  Olivier
> >  http://twitter.com/ogrisel  -http://github.com/ogrisel
> >
>
>
>
> -- 
> | Rupert Westenthalerrupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>