You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Ali Anil Sinaci <a....@gmail.com> on 2012/01/26 16:46:07 UTC

New features for Contenthub

Dear Stanbolers,

I have committed major changes related to Contenthub. Below, you can 
find some explanations about the changes. I have grouped them under two 
major issues in Jira (STANBOL-469 and STANBOL-470) although there are 
several sub-issues. Later improvements will be issued under their 
specific topics.

Contenthub includes two main parts: store and search. Solr is the 
back-end for all store and retrieve operations of content items 
(SolrContentItem extends ContentItem). Major improvements are as follows:

- Store maintains a default Solr core (called "contenthub") through the 
EmbeddedSolrServer. This default core indexes several semantic 
properties of entities in case they are retrieved from the referenced 
sites. (Current dbpedia index does not include most of these properties. 
We have a larger index for this)

- LDPath has been integrated into Contenthub.
     * Several Solr cores can be managed through LDProgramManager of 
Contenthub.
     * Each LDPath program corresponds to a unique Solr core. LDPath 
programs (hence Solr cores) are uniquely identified through their names. 
LDProgramManager and SolrCoreManager provides the required 
synchronization between Solr cores and LDPath programs.
     * Submitted LDPath programs are saved into separate files and 
accessed via a simple cache mechanism.
     * CRD operations for LDPath programs are provided through 
LDProgramManager
     * ClerezzaBackend is implemented as an LDPath backend.
     * LDProgramManager has a special method (executeProgram) to execute 
the LDPath programs on Clerezza MGraphs.
     * REST services are ready for LDProgramManager functionalities.
     * Contenthub Store and Search parts (all interfaces and REST APIs) 
are adjusted so that they can operate with LDPath programs.

- Web GUI of Contenthub only operates on the default Solr index 
("contenthub"). Enabling other cores (generated through LDPath programs) 
is in the TODO list.

- Search logic has been implemented from scratch.
     * Search engine pattern has been removed for document search.
     * Content items are indexed through Solr cores. Therefore all 
search on the content items are performed through Solr indexes.
     * Search interface has been splitted into there different 
interfaces: SolrSearch, RelatedKeywordSearch and FeaturedSearch.
     * SolrSearch is compatible with SolrJ. That is, clients who have 
already been using SolrJ can easily switch to SolrSearch API of 
Contenthub. As a result of LDPath integration, additional methods exist 
in this interface to accept LDPath program names (Solr core names). 
There is a single implementation of this interface in Contenthub.
     * RelatedKeywordSearch exposes a "search engine" pattern, but only 
to search for related keywords. RelatedKeywordSearchManager is the 
manager to handle several implementations of this interface (engines).
     * In addition to the search results retrieved from SolrSearch, 
users can now send their search keywords (query terms) to 
RelatedKeywordSearchManager to retrieve related keywords from different 
sources. This can be performed as a separate process from SolrSearch.
     * RelatedKeywordSearch has been implemented by WordnetSearch, 
OntologyResourceSearch and ReferencedSiteSearch. As their names 
indicate, they look for related keywords within their resources. 
(WordnetSearch can be excluded until the license issue is resolved or a 
new client library is used)
     * FeaturedSearch combines the capabilities of SolrSearch and 
RelatedKeywordSearch in case a client wants to retrieve all results 
(content items and related keywords) from Contenthub search.
     * FeaturedSearch provides a similar interface to SolrSearch with 
additional methods. However, behaviour is different, it is "featured" in 
this implementation.
     * FeaturedSearch provides a special method: tokenizeEntities. This 
method takes a query string and finds out whether there exists any 
entities in the query or not. Based on the discovered entities, 
FeaturedSearch prepares Solr queries in special formats to boost the 
results related with the entities. However, this method should be 
improved to cover a massive number of possible cases which can occur 
during keyword searches.
     * FeaturedSearch provides special methods to ease the faceted 
search. Web GUI of Contenthub makes use of this interface to enable 
faceted search.

Some minor improvements are as follows:

- Web resources of Contenthub has been adjusted according to the latest 
improvements.

- Contenthub/core bundle has been removed. Refactoring Contenthub has 
leaded to a more efficient use of several classes, hence currently there 
is no need for a separate core bundle.

- Contenthub parent pom has been adjusted. All dependencies has been 
moved into Stanbol parent.

- helper/cnn-importer repacked under crawler/cnn

- api repacked under servicesapi

- Sling based unit and integration tests are on the way.

All the best,
Anil.

Re: New features for Contenthub

Posted by Ali Anil Sinaci <a....@gmail.com>.

Hi Fabian,

Web GUI and documentation are the next steps in our plan. Afterwards we 
will prepare a demo for these features.

Best,
Anil.

On 01/26/2012 08:35 PM, Fabian Christ wrote:
> Hi Ali and Suat,
>
> sorry for the mistake. My mail was meant to be addressed in reply to Ali ;)
>
> Am 26. Januar 2012 19:33 schrieb Fabian Christ<ch...@googlemail.com>:
>> Hi Suat,
>>
>> this is a really impressive list of changes and features. Do you have
>> plans regarding documentation, demos, tutorials?
>>
>> Best,
>>   - Fabian
>>
>> Am 26. Januar 2012 16:46 schrieb Ali Anil Sinaci<a....@gmail.com>:
>>> Dear Stanbolers,
>>>
>>> I have committed major changes related to Contenthub. Below, you can find
>>> some explanations about the changes. I have grouped them under two major
>>> issues in Jira (STANBOL-469 and STANBOL-470) although there are several
>>> sub-issues. Later improvements will be issued under their specific topics.
>>>
>>> Contenthub includes two main parts: store and search. Solr is the back-end
>>> for all store and retrieve operations of content items (SolrContentItem
>>> extends ContentItem). Major improvements are as follows:
>>>
>>> - Store maintains a default Solr core (called "contenthub") through the
>>> EmbeddedSolrServer. This default core indexes several semantic properties of
>>> entities in case they are retrieved from the referenced sites. (Current
>>> dbpedia index does not include most of these properties. We have a larger
>>> index for this)
>>>
>>> - LDPath has been integrated into Contenthub.
>>>     * Several Solr cores can be managed through LDProgramManager of
>>> Contenthub.
>>>     * Each LDPath program corresponds to a unique Solr core. LDPath programs
>>> (hence Solr cores) are uniquely identified through their names.
>>> LDProgramManager and SolrCoreManager provides the required synchronization
>>> between Solr cores and LDPath programs.
>>>     * Submitted LDPath programs are saved into separate files and accessed
>>> via a simple cache mechanism.
>>>     * CRD operations for LDPath programs are provided through
>>> LDProgramManager
>>>     * ClerezzaBackend is implemented as an LDPath backend.
>>>     * LDProgramManager has a special method (executeProgram) to execute the
>>> LDPath programs on Clerezza MGraphs.
>>>     * REST services are ready for LDProgramManager functionalities.
>>>     * Contenthub Store and Search parts (all interfaces and REST APIs) are
>>> adjusted so that they can operate with LDPath programs.
>>>
>>> - Web GUI of Contenthub only operates on the default Solr index
>>> ("contenthub"). Enabling other cores (generated through LDPath programs) is
>>> in the TODO list.
>>>
>>> - Search logic has been implemented from scratch.
>>>     * Search engine pattern has been removed for document search.
>>>     * Content items are indexed through Solr cores. Therefore all search on
>>> the content items are performed through Solr indexes.
>>>     * Search interface has been splitted into there different interfaces:
>>> SolrSearch, RelatedKeywordSearch and FeaturedSearch.
>>>     * SolrSearch is compatible with SolrJ. That is, clients who have already
>>> been using SolrJ can easily switch to SolrSearch API of Contenthub. As a
>>> result of LDPath integration, additional methods exist in this interface to
>>> accept LDPath program names (Solr core names). There is a single
>>> implementation of this interface in Contenthub.
>>>     * RelatedKeywordSearch exposes a "search engine" pattern, but only to
>>> search for related keywords. RelatedKeywordSearchManager is the manager to
>>> handle several implementations of this interface (engines).
>>>     * In addition to the search results retrieved from SolrSearch, users can
>>> now send their search keywords (query terms) to RelatedKeywordSearchManager
>>> to retrieve related keywords from different sources. This can be performed
>>> as a separate process from SolrSearch.
>>>     * RelatedKeywordSearch has been implemented by WordnetSearch,
>>> OntologyResourceSearch and ReferencedSiteSearch. As their names indicate,
>>> they look for related keywords within their resources. (WordnetSearch can be
>>> excluded until the license issue is resolved or a new client library is
>>> used)
>>>     * FeaturedSearch combines the capabilities of SolrSearch and
>>> RelatedKeywordSearch in case a client wants to retrieve all results (content
>>> items and related keywords) from Contenthub search.
>>>     * FeaturedSearch provides a similar interface to SolrSearch with
>>> additional methods. However, behaviour is different, it is "featured" in
>>> this implementation.
>>>     * FeaturedSearch provides a special method: tokenizeEntities. This method
>>> takes a query string and finds out whether there exists any entities in the
>>> query or not. Based on the discovered entities, FeaturedSearch prepares Solr
>>> queries in special formats to boost the results related with the entities.
>>> However, this method should be improved to cover a massive number of
>>> possible cases which can occur during keyword searches.
>>>     * FeaturedSearch provides special methods to ease the faceted search. Web
>>> GUI of Contenthub makes use of this interface to enable faceted search.
>>>
>>> Some minor improvements are as follows:
>>>
>>> - Web resources of Contenthub has been adjusted according to the latest
>>> improvements.
>>>
>>> - Contenthub/core bundle has been removed. Refactoring Contenthub has leaded
>>> to a more efficient use of several classes, hence currently there is no need
>>> for a separate core bundle.
>>>
>>> - Contenthub parent pom has been adjusted. All dependencies has been moved
>>> into Stanbol parent.
>>>
>>> - helper/cnn-importer repacked under crawler/cnn
>>>
>>> - api repacked under servicesapi
>>>
>>> - Sling based unit and integration tests are on the way.
>>>
>>> All the best,
>>> Anil.
>>
>>
>> --
>> Fabian
>> http://twitter.com/fctwitt
>
>

Re: New features for Contenthub

Posted by Fabian Christ <ch...@googlemail.com>.

Hi Ali and Suat,

sorry for the mistake. My mail was meant to be addressed in reply to Ali ;)

Am 26. Januar 2012 19:33 schrieb Fabian Christ <ch...@googlemail.com>:
> Hi Suat,
>
> this is a really impressive list of changes and features. Do you have
> plans regarding documentation, demos, tutorials?
>
> Best,
>  - Fabian
>
> Am 26. Januar 2012 16:46 schrieb Ali Anil Sinaci <a....@gmail.com>:
>> Dear Stanbolers,
>>
>> I have committed major changes related to Contenthub. Below, you can find
>> some explanations about the changes. I have grouped them under two major
>> issues in Jira (STANBOL-469 and STANBOL-470) although there are several
>> sub-issues. Later improvements will be issued under their specific topics.
>>
>> Contenthub includes two main parts: store and search. Solr is the back-end
>> for all store and retrieve operations of content items (SolrContentItem
>> extends ContentItem). Major improvements are as follows:
>>
>> - Store maintains a default Solr core (called "contenthub") through the
>> EmbeddedSolrServer. This default core indexes several semantic properties of
>> entities in case they are retrieved from the referenced sites. (Current
>> dbpedia index does not include most of these properties. We have a larger
>> index for this)
>>
>> - LDPath has been integrated into Contenthub.
>>    * Several Solr cores can be managed through LDProgramManager of
>> Contenthub.
>>    * Each LDPath program corresponds to a unique Solr core. LDPath programs
>> (hence Solr cores) are uniquely identified through their names.
>> LDProgramManager and SolrCoreManager provides the required synchronization
>> between Solr cores and LDPath programs.
>>    * Submitted LDPath programs are saved into separate files and accessed
>> via a simple cache mechanism.
>>    * CRD operations for LDPath programs are provided through
>> LDProgramManager
>>    * ClerezzaBackend is implemented as an LDPath backend.
>>    * LDProgramManager has a special method (executeProgram) to execute the
>> LDPath programs on Clerezza MGraphs.
>>    * REST services are ready for LDProgramManager functionalities.
>>    * Contenthub Store and Search parts (all interfaces and REST APIs) are
>> adjusted so that they can operate with LDPath programs.
>>
>> - Web GUI of Contenthub only operates on the default Solr index
>> ("contenthub"). Enabling other cores (generated through LDPath programs) is
>> in the TODO list.
>>
>> - Search logic has been implemented from scratch.
>>    * Search engine pattern has been removed for document search.
>>    * Content items are indexed through Solr cores. Therefore all search on
>> the content items are performed through Solr indexes.
>>    * Search interface has been splitted into there different interfaces:
>> SolrSearch, RelatedKeywordSearch and FeaturedSearch.
>>    * SolrSearch is compatible with SolrJ. That is, clients who have already
>> been using SolrJ can easily switch to SolrSearch API of Contenthub. As a
>> result of LDPath integration, additional methods exist in this interface to
>> accept LDPath program names (Solr core names). There is a single
>> implementation of this interface in Contenthub.
>>    * RelatedKeywordSearch exposes a "search engine" pattern, but only to
>> search for related keywords. RelatedKeywordSearchManager is the manager to
>> handle several implementations of this interface (engines).
>>    * In addition to the search results retrieved from SolrSearch, users can
>> now send their search keywords (query terms) to RelatedKeywordSearchManager
>> to retrieve related keywords from different sources. This can be performed
>> as a separate process from SolrSearch.
>>    * RelatedKeywordSearch has been implemented by WordnetSearch,
>> OntologyResourceSearch and ReferencedSiteSearch. As their names indicate,
>> they look for related keywords within their resources. (WordnetSearch can be
>> excluded until the license issue is resolved or a new client library is
>> used)
>>    * FeaturedSearch combines the capabilities of SolrSearch and
>> RelatedKeywordSearch in case a client wants to retrieve all results (content
>> items and related keywords) from Contenthub search.
>>    * FeaturedSearch provides a similar interface to SolrSearch with
>> additional methods. However, behaviour is different, it is "featured" in
>> this implementation.
>>    * FeaturedSearch provides a special method: tokenizeEntities. This method
>> takes a query string and finds out whether there exists any entities in the
>> query or not. Based on the discovered entities, FeaturedSearch prepares Solr
>> queries in special formats to boost the results related with the entities.
>> However, this method should be improved to cover a massive number of
>> possible cases which can occur during keyword searches.
>>    * FeaturedSearch provides special methods to ease the faceted search. Web
>> GUI of Contenthub makes use of this interface to enable faceted search.
>>
>> Some minor improvements are as follows:
>>
>> - Web resources of Contenthub has been adjusted according to the latest
>> improvements.
>>
>> - Contenthub/core bundle has been removed. Refactoring Contenthub has leaded
>> to a more efficient use of several classes, hence currently there is no need
>> for a separate core bundle.
>>
>> - Contenthub parent pom has been adjusted. All dependencies has been moved
>> into Stanbol parent.
>>
>> - helper/cnn-importer repacked under crawler/cnn
>>
>> - api repacked under servicesapi
>>
>> - Sling based unit and integration tests are on the way.
>>
>> All the best,
>> Anil.
>
>
>
> --
> Fabian
> http://twitter.com/fctwitt



-- 
Fabian
http://twitter.com/fctwitt

Re: New features for Contenthub

Posted by Fabian Christ <ch...@googlemail.com>.

Hi Suat,

this is a really impressive list of changes and features. Do you have
plans regarding documentation, demos, tutorials?

Best,
 - Fabian

Am 26. Januar 2012 16:46 schrieb Ali Anil Sinaci <a....@gmail.com>:
> Dear Stanbolers,
>
> I have committed major changes related to Contenthub. Below, you can find
> some explanations about the changes. I have grouped them under two major
> issues in Jira (STANBOL-469 and STANBOL-470) although there are several
> sub-issues. Later improvements will be issued under their specific topics.
>
> Contenthub includes two main parts: store and search. Solr is the back-end
> for all store and retrieve operations of content items (SolrContentItem
> extends ContentItem). Major improvements are as follows:
>
> - Store maintains a default Solr core (called "contenthub") through the
> EmbeddedSolrServer. This default core indexes several semantic properties of
> entities in case they are retrieved from the referenced sites. (Current
> dbpedia index does not include most of these properties. We have a larger
> index for this)
>
> - LDPath has been integrated into Contenthub.
>    * Several Solr cores can be managed through LDProgramManager of
> Contenthub.
>    * Each LDPath program corresponds to a unique Solr core. LDPath programs
> (hence Solr cores) are uniquely identified through their names.
> LDProgramManager and SolrCoreManager provides the required synchronization
> between Solr cores and LDPath programs.
>    * Submitted LDPath programs are saved into separate files and accessed
> via a simple cache mechanism.
>    * CRD operations for LDPath programs are provided through
> LDProgramManager
>    * ClerezzaBackend is implemented as an LDPath backend.
>    * LDProgramManager has a special method (executeProgram) to execute the
> LDPath programs on Clerezza MGraphs.
>    * REST services are ready for LDProgramManager functionalities.
>    * Contenthub Store and Search parts (all interfaces and REST APIs) are
> adjusted so that they can operate with LDPath programs.
>
> - Web GUI of Contenthub only operates on the default Solr index
> ("contenthub"). Enabling other cores (generated through LDPath programs) is
> in the TODO list.
>
> - Search logic has been implemented from scratch.
>    * Search engine pattern has been removed for document search.
>    * Content items are indexed through Solr cores. Therefore all search on
> the content items are performed through Solr indexes.
>    * Search interface has been splitted into there different interfaces:
> SolrSearch, RelatedKeywordSearch and FeaturedSearch.
>    * SolrSearch is compatible with SolrJ. That is, clients who have already
> been using SolrJ can easily switch to SolrSearch API of Contenthub. As a
> result of LDPath integration, additional methods exist in this interface to
> accept LDPath program names (Solr core names). There is a single
> implementation of this interface in Contenthub.
>    * RelatedKeywordSearch exposes a "search engine" pattern, but only to
> search for related keywords. RelatedKeywordSearchManager is the manager to
> handle several implementations of this interface (engines).
>    * In addition to the search results retrieved from SolrSearch, users can
> now send their search keywords (query terms) to RelatedKeywordSearchManager
> to retrieve related keywords from different sources. This can be performed
> as a separate process from SolrSearch.
>    * RelatedKeywordSearch has been implemented by WordnetSearch,
> OntologyResourceSearch and ReferencedSiteSearch. As their names indicate,
> they look for related keywords within their resources. (WordnetSearch can be
> excluded until the license issue is resolved or a new client library is
> used)
>    * FeaturedSearch combines the capabilities of SolrSearch and
> RelatedKeywordSearch in case a client wants to retrieve all results (content
> items and related keywords) from Contenthub search.
>    * FeaturedSearch provides a similar interface to SolrSearch with
> additional methods. However, behaviour is different, it is "featured" in
> this implementation.
>    * FeaturedSearch provides a special method: tokenizeEntities. This method
> takes a query string and finds out whether there exists any entities in the
> query or not. Based on the discovered entities, FeaturedSearch prepares Solr
> queries in special formats to boost the results related with the entities.
> However, this method should be improved to cover a massive number of
> possible cases which can occur during keyword searches.
>    * FeaturedSearch provides special methods to ease the faceted search. Web
> GUI of Contenthub makes use of this interface to enable faceted search.
>
> Some minor improvements are as follows:
>
> - Web resources of Contenthub has been adjusted according to the latest
> improvements.
>
> - Contenthub/core bundle has been removed. Refactoring Contenthub has leaded
> to a more efficient use of several classes, hence currently there is no need
> for a separate core bundle.
>
> - Contenthub parent pom has been adjusted. All dependencies has been moved
> into Stanbol parent.
>
> - helper/cnn-importer repacked under crawler/cnn
>
> - api repacked under servicesapi
>
> - Sling based unit and integration tests are on the way.
>
> All the best,
> Anil.



-- 
Fabian
http://twitter.com/fctwitt

Re: New features for Contenthub

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi Anil

Thats looks great. I will make a much more detailed review of all that.

On Thu, Jan 26, 2012 at 4:46 PM, Ali Anil Sinaci <a....@gmail.com> wrote:
> (Current
> dbpedia index does not include most of these properties. We have a larger
> index for this)
>

Can you please upload this index to

    http://dev.iks-project.eu/downloads/stanbol-indices/

give me a short ping on IRC if you do not have an Account yet.

best
Rupert


-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen