You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Ali Anil Sinaci <a....@gmail.com> on 2012/01/26 16:46:07 UTC
New features for Contenthub
Dear Stanbolers,
I have committed major changes related to Contenthub. Below, you can
find some explanations about the changes. I have grouped them under two
major issues in Jira (STANBOL-469 and STANBOL-470) although there are
several sub-issues. Later improvements will be issued under their
specific topics.
Contenthub includes two main parts: store and search. Solr is the
back-end for all store and retrieve operations of content items
(SolrContentItem extends ContentItem). Major improvements are as follows:
- Store maintains a default Solr core (called "contenthub") through the
EmbeddedSolrServer. This default core indexes several semantic
properties of entities in case they are retrieved from the referenced
sites. (Current dbpedia index does not include most of these properties.
We have a larger index for this)
- LDPath has been integrated into Contenthub.
* Several Solr cores can be managed through LDProgramManager of
Contenthub.
* Each LDPath program corresponds to a unique Solr core. LDPath
programs (hence Solr cores) are uniquely identified through their names.
LDProgramManager and SolrCoreManager provides the required
synchronization between Solr cores and LDPath programs.
* Submitted LDPath programs are saved into separate files and
accessed via a simple cache mechanism.
* CRD operations for LDPath programs are provided through
LDProgramManager
* ClerezzaBackend is implemented as an LDPath backend.
* LDProgramManager has a special method (executeProgram) to execute
the LDPath programs on Clerezza MGraphs.
* REST services are ready for LDProgramManager functionalities.
* Contenthub Store and Search parts (all interfaces and REST APIs)
are adjusted so that they can operate with LDPath programs.
- Web GUI of Contenthub only operates on the default Solr index
("contenthub"). Enabling other cores (generated through LDPath programs)
is in the TODO list.
- Search logic has been implemented from scratch.
* Search engine pattern has been removed for document search.
* Content items are indexed through Solr cores. Therefore all
search on the content items are performed through Solr indexes.
* Search interface has been splitted into there different
interfaces: SolrSearch, RelatedKeywordSearch and FeaturedSearch.
* SolrSearch is compatible with SolrJ. That is, clients who have
already been using SolrJ can easily switch to SolrSearch API of
Contenthub. As a result of LDPath integration, additional methods exist
in this interface to accept LDPath program names (Solr core names).
There is a single implementation of this interface in Contenthub.
* RelatedKeywordSearch exposes a "search engine" pattern, but only
to search for related keywords. RelatedKeywordSearchManager is the
manager to handle several implementations of this interface (engines).
* In addition to the search results retrieved from SolrSearch,
users can now send their search keywords (query terms) to
RelatedKeywordSearchManager to retrieve related keywords from different
sources. This can be performed as a separate process from SolrSearch.
* RelatedKeywordSearch has been implemented by WordnetSearch,
OntologyResourceSearch and ReferencedSiteSearch. As their names
indicate, they look for related keywords within their resources.
(WordnetSearch can be excluded until the license issue is resolved or a
new client library is used)
* FeaturedSearch combines the capabilities of SolrSearch and
RelatedKeywordSearch in case a client wants to retrieve all results
(content items and related keywords) from Contenthub search.
* FeaturedSearch provides a similar interface to SolrSearch with
additional methods. However, behaviour is different, it is "featured" in
this implementation.
* FeaturedSearch provides a special method: tokenizeEntities. This
method takes a query string and finds out whether there exists any
entities in the query or not. Based on the discovered entities,
FeaturedSearch prepares Solr queries in special formats to boost the
results related with the entities. However, this method should be
improved to cover a massive number of possible cases which can occur
during keyword searches.
* FeaturedSearch provides special methods to ease the faceted
search. Web GUI of Contenthub makes use of this interface to enable
faceted search.
Some minor improvements are as follows:
- Web resources of Contenthub has been adjusted according to the latest
improvements.
- Contenthub/core bundle has been removed. Refactoring Contenthub has
leaded to a more efficient use of several classes, hence currently there
is no need for a separate core bundle.
- Contenthub parent pom has been adjusted. All dependencies has been
moved into Stanbol parent.
- helper/cnn-importer repacked under crawler/cnn
- api repacked under servicesapi
- Sling based unit and integration tests are on the way.
All the best,
Anil.
Re: New features for Contenthub
Posted by Ali Anil Sinaci <a....@gmail.com>.
Hi Fabian,
Web GUI and documentation are the next steps in our plan. Afterwards we
will prepare a demo for these features.
Best,
Anil.
On 01/26/2012 08:35 PM, Fabian Christ wrote:
> Hi Ali and Suat,
>
> sorry for the mistake. My mail was meant to be addressed in reply to Ali ;)
>
> Am 26. Januar 2012 19:33 schrieb Fabian Christ<ch...@googlemail.com>:
>> Hi Suat,
>>
>> this is a really impressive list of changes and features. Do you have
>> plans regarding documentation, demos, tutorials?
>>
>> Best,
>> - Fabian
>>
>> Am 26. Januar 2012 16:46 schrieb Ali Anil Sinaci<a....@gmail.com>:
>>> Dear Stanbolers,
>>>
>>> I have committed major changes related to Contenthub. Below, you can find
>>> some explanations about the changes. I have grouped them under two major
>>> issues in Jira (STANBOL-469 and STANBOL-470) although there are several
>>> sub-issues. Later improvements will be issued under their specific topics.
>>>
>>> Contenthub includes two main parts: store and search. Solr is the back-end
>>> for all store and retrieve operations of content items (SolrContentItem
>>> extends ContentItem). Major improvements are as follows:
>>>
>>> - Store maintains a default Solr core (called "contenthub") through the
>>> EmbeddedSolrServer. This default core indexes several semantic properties of
>>> entities in case they are retrieved from the referenced sites. (Current
>>> dbpedia index does not include most of these properties. We have a larger
>>> index for this)
>>>
>>> - LDPath has been integrated into Contenthub.
>>> * Several Solr cores can be managed through LDProgramManager of
>>> Contenthub.
>>> * Each LDPath program corresponds to a unique Solr core. LDPath programs
>>> (hence Solr cores) are uniquely identified through their names.
>>> LDProgramManager and SolrCoreManager provides the required synchronization
>>> between Solr cores and LDPath programs.
>>> * Submitted LDPath programs are saved into separate files and accessed
>>> via a simple cache mechanism.
>>> * CRD operations for LDPath programs are provided through
>>> LDProgramManager
>>> * ClerezzaBackend is implemented as an LDPath backend.
>>> * LDProgramManager has a special method (executeProgram) to execute the
>>> LDPath programs on Clerezza MGraphs.
>>> * REST services are ready for LDProgramManager functionalities.
>>> * Contenthub Store and Search parts (all interfaces and REST APIs) are
>>> adjusted so that they can operate with LDPath programs.
>>>
>>> - Web GUI of Contenthub only operates on the default Solr index
>>> ("contenthub"). Enabling other cores (generated through LDPath programs) is
>>> in the TODO list.
>>>
>>> - Search logic has been implemented from scratch.
>>> * Search engine pattern has been removed for document search.
>>> * Content items are indexed through Solr cores. Therefore all search on
>>> the content items are performed through Solr indexes.
>>> * Search interface has been splitted into there different interfaces:
>>> SolrSearch, RelatedKeywordSearch and FeaturedSearch.
>>> * SolrSearch is compatible with SolrJ. That is, clients who have already
>>> been using SolrJ can easily switch to SolrSearch API of Contenthub. As a
>>> result of LDPath integration, additional methods exist in this interface to
>>> accept LDPath program names (Solr core names). There is a single
>>> implementation of this interface in Contenthub.
>>> * RelatedKeywordSearch exposes a "search engine" pattern, but only to
>>> search for related keywords. RelatedKeywordSearchManager is the manager to
>>> handle several implementations of this interface (engines).
>>> * In addition to the search results retrieved from SolrSearch, users can
>>> now send their search keywords (query terms) to RelatedKeywordSearchManager
>>> to retrieve related keywords from different sources. This can be performed
>>> as a separate process from SolrSearch.
>>> * RelatedKeywordSearch has been implemented by WordnetSearch,
>>> OntologyResourceSearch and ReferencedSiteSearch. As their names indicate,
>>> they look for related keywords within their resources. (WordnetSearch can be
>>> excluded until the license issue is resolved or a new client library is
>>> used)
>>> * FeaturedSearch combines the capabilities of SolrSearch and
>>> RelatedKeywordSearch in case a client wants to retrieve all results (content
>>> items and related keywords) from Contenthub search.
>>> * FeaturedSearch provides a similar interface to SolrSearch with
>>> additional methods. However, behaviour is different, it is "featured" in
>>> this implementation.
>>> * FeaturedSearch provides a special method: tokenizeEntities. This method
>>> takes a query string and finds out whether there exists any entities in the
>>> query or not. Based on the discovered entities, FeaturedSearch prepares Solr
>>> queries in special formats to boost the results related with the entities.
>>> However, this method should be improved to cover a massive number of
>>> possible cases which can occur during keyword searches.
>>> * FeaturedSearch provides special methods to ease the faceted search. Web
>>> GUI of Contenthub makes use of this interface to enable faceted search.
>>>
>>> Some minor improvements are as follows:
>>>
>>> - Web resources of Contenthub has been adjusted according to the latest
>>> improvements.
>>>
>>> - Contenthub/core bundle has been removed. Refactoring Contenthub has leaded
>>> to a more efficient use of several classes, hence currently there is no need
>>> for a separate core bundle.
>>>
>>> - Contenthub parent pom has been adjusted. All dependencies has been moved
>>> into Stanbol parent.
>>>
>>> - helper/cnn-importer repacked under crawler/cnn
>>>
>>> - api repacked under servicesapi
>>>
>>> - Sling based unit and integration tests are on the way.
>>>
>>> All the best,
>>> Anil.
>>
>>
>> --
>> Fabian
>> http://twitter.com/fctwitt
>
>
Re: New features for Contenthub
Posted by Fabian Christ <ch...@googlemail.com>.
Hi Ali and Suat,
sorry for the mistake. My mail was meant to be addressed in reply to Ali ;)
Am 26. Januar 2012 19:33 schrieb Fabian Christ <ch...@googlemail.com>:
> Hi Suat,
>
> this is a really impressive list of changes and features. Do you have
> plans regarding documentation, demos, tutorials?
>
> Best,
> - Fabian
>
> Am 26. Januar 2012 16:46 schrieb Ali Anil Sinaci <a....@gmail.com>:
>> Dear Stanbolers,
>>
>> I have committed major changes related to Contenthub. Below, you can find
>> some explanations about the changes. I have grouped them under two major
>> issues in Jira (STANBOL-469 and STANBOL-470) although there are several
>> sub-issues. Later improvements will be issued under their specific topics.
>>
>> Contenthub includes two main parts: store and search. Solr is the back-end
>> for all store and retrieve operations of content items (SolrContentItem
>> extends ContentItem). Major improvements are as follows:
>>
>> - Store maintains a default Solr core (called "contenthub") through the
>> EmbeddedSolrServer. This default core indexes several semantic properties of
>> entities in case they are retrieved from the referenced sites. (Current
>> dbpedia index does not include most of these properties. We have a larger
>> index for this)
>>
>> - LDPath has been integrated into Contenthub.
>> * Several Solr cores can be managed through LDProgramManager of
>> Contenthub.
>> * Each LDPath program corresponds to a unique Solr core. LDPath programs
>> (hence Solr cores) are uniquely identified through their names.
>> LDProgramManager and SolrCoreManager provides the required synchronization
>> between Solr cores and LDPath programs.
>> * Submitted LDPath programs are saved into separate files and accessed
>> via a simple cache mechanism.
>> * CRD operations for LDPath programs are provided through
>> LDProgramManager
>> * ClerezzaBackend is implemented as an LDPath backend.
>> * LDProgramManager has a special method (executeProgram) to execute the
>> LDPath programs on Clerezza MGraphs.
>> * REST services are ready for LDProgramManager functionalities.
>> * Contenthub Store and Search parts (all interfaces and REST APIs) are
>> adjusted so that they can operate with LDPath programs.
>>
>> - Web GUI of Contenthub only operates on the default Solr index
>> ("contenthub"). Enabling other cores (generated through LDPath programs) is
>> in the TODO list.
>>
>> - Search logic has been implemented from scratch.
>> * Search engine pattern has been removed for document search.
>> * Content items are indexed through Solr cores. Therefore all search on
>> the content items are performed through Solr indexes.
>> * Search interface has been splitted into there different interfaces:
>> SolrSearch, RelatedKeywordSearch and FeaturedSearch.
>> * SolrSearch is compatible with SolrJ. That is, clients who have already
>> been using SolrJ can easily switch to SolrSearch API of Contenthub. As a
>> result of LDPath integration, additional methods exist in this interface to
>> accept LDPath program names (Solr core names). There is a single
>> implementation of this interface in Contenthub.
>> * RelatedKeywordSearch exposes a "search engine" pattern, but only to
>> search for related keywords. RelatedKeywordSearchManager is the manager to
>> handle several implementations of this interface (engines).
>> * In addition to the search results retrieved from SolrSearch, users can
>> now send their search keywords (query terms) to RelatedKeywordSearchManager
>> to retrieve related keywords from different sources. This can be performed
>> as a separate process from SolrSearch.
>> * RelatedKeywordSearch has been implemented by WordnetSearch,
>> OntologyResourceSearch and ReferencedSiteSearch. As their names indicate,
>> they look for related keywords within their resources. (WordnetSearch can be
>> excluded until the license issue is resolved or a new client library is
>> used)
>> * FeaturedSearch combines the capabilities of SolrSearch and
>> RelatedKeywordSearch in case a client wants to retrieve all results (content
>> items and related keywords) from Contenthub search.
>> * FeaturedSearch provides a similar interface to SolrSearch with
>> additional methods. However, behaviour is different, it is "featured" in
>> this implementation.
>> * FeaturedSearch provides a special method: tokenizeEntities. This method
>> takes a query string and finds out whether there exists any entities in the
>> query or not. Based on the discovered entities, FeaturedSearch prepares Solr
>> queries in special formats to boost the results related with the entities.
>> However, this method should be improved to cover a massive number of
>> possible cases which can occur during keyword searches.
>> * FeaturedSearch provides special methods to ease the faceted search. Web
>> GUI of Contenthub makes use of this interface to enable faceted search.
>>
>> Some minor improvements are as follows:
>>
>> - Web resources of Contenthub has been adjusted according to the latest
>> improvements.
>>
>> - Contenthub/core bundle has been removed. Refactoring Contenthub has leaded
>> to a more efficient use of several classes, hence currently there is no need
>> for a separate core bundle.
>>
>> - Contenthub parent pom has been adjusted. All dependencies has been moved
>> into Stanbol parent.
>>
>> - helper/cnn-importer repacked under crawler/cnn
>>
>> - api repacked under servicesapi
>>
>> - Sling based unit and integration tests are on the way.
>>
>> All the best,
>> Anil.
>
>
>
> --
> Fabian
> http://twitter.com/fctwitt
--
Fabian
http://twitter.com/fctwitt
Re: New features for Contenthub
Posted by Fabian Christ <ch...@googlemail.com>.
Hi Suat,
this is a really impressive list of changes and features. Do you have
plans regarding documentation, demos, tutorials?
Best,
- Fabian
Am 26. Januar 2012 16:46 schrieb Ali Anil Sinaci <a....@gmail.com>:
> Dear Stanbolers,
>
> I have committed major changes related to Contenthub. Below, you can find
> some explanations about the changes. I have grouped them under two major
> issues in Jira (STANBOL-469 and STANBOL-470) although there are several
> sub-issues. Later improvements will be issued under their specific topics.
>
> Contenthub includes two main parts: store and search. Solr is the back-end
> for all store and retrieve operations of content items (SolrContentItem
> extends ContentItem). Major improvements are as follows:
>
> - Store maintains a default Solr core (called "contenthub") through the
> EmbeddedSolrServer. This default core indexes several semantic properties of
> entities in case they are retrieved from the referenced sites. (Current
> dbpedia index does not include most of these properties. We have a larger
> index for this)
>
> - LDPath has been integrated into Contenthub.
> * Several Solr cores can be managed through LDProgramManager of
> Contenthub.
> * Each LDPath program corresponds to a unique Solr core. LDPath programs
> (hence Solr cores) are uniquely identified through their names.
> LDProgramManager and SolrCoreManager provides the required synchronization
> between Solr cores and LDPath programs.
> * Submitted LDPath programs are saved into separate files and accessed
> via a simple cache mechanism.
> * CRD operations for LDPath programs are provided through
> LDProgramManager
> * ClerezzaBackend is implemented as an LDPath backend.
> * LDProgramManager has a special method (executeProgram) to execute the
> LDPath programs on Clerezza MGraphs.
> * REST services are ready for LDProgramManager functionalities.
> * Contenthub Store and Search parts (all interfaces and REST APIs) are
> adjusted so that they can operate with LDPath programs.
>
> - Web GUI of Contenthub only operates on the default Solr index
> ("contenthub"). Enabling other cores (generated through LDPath programs) is
> in the TODO list.
>
> - Search logic has been implemented from scratch.
> * Search engine pattern has been removed for document search.
> * Content items are indexed through Solr cores. Therefore all search on
> the content items are performed through Solr indexes.
> * Search interface has been splitted into there different interfaces:
> SolrSearch, RelatedKeywordSearch and FeaturedSearch.
> * SolrSearch is compatible with SolrJ. That is, clients who have already
> been using SolrJ can easily switch to SolrSearch API of Contenthub. As a
> result of LDPath integration, additional methods exist in this interface to
> accept LDPath program names (Solr core names). There is a single
> implementation of this interface in Contenthub.
> * RelatedKeywordSearch exposes a "search engine" pattern, but only to
> search for related keywords. RelatedKeywordSearchManager is the manager to
> handle several implementations of this interface (engines).
> * In addition to the search results retrieved from SolrSearch, users can
> now send their search keywords (query terms) to RelatedKeywordSearchManager
> to retrieve related keywords from different sources. This can be performed
> as a separate process from SolrSearch.
> * RelatedKeywordSearch has been implemented by WordnetSearch,
> OntologyResourceSearch and ReferencedSiteSearch. As their names indicate,
> they look for related keywords within their resources. (WordnetSearch can be
> excluded until the license issue is resolved or a new client library is
> used)
> * FeaturedSearch combines the capabilities of SolrSearch and
> RelatedKeywordSearch in case a client wants to retrieve all results (content
> items and related keywords) from Contenthub search.
> * FeaturedSearch provides a similar interface to SolrSearch with
> additional methods. However, behaviour is different, it is "featured" in
> this implementation.
> * FeaturedSearch provides a special method: tokenizeEntities. This method
> takes a query string and finds out whether there exists any entities in the
> query or not. Based on the discovered entities, FeaturedSearch prepares Solr
> queries in special formats to boost the results related with the entities.
> However, this method should be improved to cover a massive number of
> possible cases which can occur during keyword searches.
> * FeaturedSearch provides special methods to ease the faceted search. Web
> GUI of Contenthub makes use of this interface to enable faceted search.
>
> Some minor improvements are as follows:
>
> - Web resources of Contenthub has been adjusted according to the latest
> improvements.
>
> - Contenthub/core bundle has been removed. Refactoring Contenthub has leaded
> to a more efficient use of several classes, hence currently there is no need
> for a separate core bundle.
>
> - Contenthub parent pom has been adjusted. All dependencies has been moved
> into Stanbol parent.
>
> - helper/cnn-importer repacked under crawler/cnn
>
> - api repacked under servicesapi
>
> - Sling based unit and integration tests are on the way.
>
> All the best,
> Anil.
--
Fabian
http://twitter.com/fctwitt
Re: New features for Contenthub
Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Anil
Thats looks great. I will make a much more detailed review of all that.
On Thu, Jan 26, 2012 at 4:46 PM, Ali Anil Sinaci <a....@gmail.com> wrote:
> (Current
> dbpedia index does not include most of these properties. We have a larger
> index for this)
>
Can you please upload this index to
http://dev.iks-project.eu/downloads/stanbol-indices/
give me a short ping on IRC if you do not have an Account yet.
best
Rupert
--
| Rupert Westenthaler rupert.westenthaler@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen