You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2011/04/29 11:04:03 UTC

[jira] [Created] (STANBOL-187) Extendable indexing infrastructure for the Entityhub

Extendable indexing infrastructure for the Entityhub
----------------------------------------------------

                 Key: STANBOL-187
                 URL: https://issues.apache.org/jira/browse/STANBOL-187
             Project: Stanbol
          Issue Type: Improvement
          Components: Entity Hub
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler


Currently the Entityhub includes some utilities to create Indexes for dbPedia, geonames and dblp. There exists also an generic RDF indexer that is used by the dbPedia and dblp however also this implementation is not extendable and not really suitable to add features requested by issues like STANBOL-92, STANBOL-93 and STANBOL-163.

The goal is to create an infrastructure that provides an implementation of
 - the indexing workflow
 - configuration and initialization
and defines Interfaces that allows to plug in
 - different Data Sources
 - entity ranking implementations
 - entity data mapper (e.g. filtering some fields, schema translations ...)
 - indexing targets (the Yard that stores the indexed entities)

The existing Indexing utilities need to be moved to use the new Infrastructure

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-187) Extendable indexing infrastructure for the Entityhub

Posted by "Rupert Westenthaler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/STANBOL-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026933#comment-13026933 ] 

Rupert Westenthaler commented on STANBOL-187:
---------------------------------------------

Some initial Documentation:

- - -

Indexing API:

(0) IndexingComponent:
The parent interface of most of the following interfaces. This is used to set the configuration, start the initialization and close the component as soon as the indexing has finished.

(1) Indexing Source

Source information are divided in two categories:
 - Entity Data: Provides the Data for the entity (Representation)
 - Entity ID/Score: Provide id and Score (e.g. pageRank) for the Entity

There are two modi for indexing:
(a) Iterate of the data and lookup/calculate the score
(b) Iterate over the entity ids/scores and lookup the data

for (a) the following interfaces are used:
 - EntityDataIterable to iterate over the entity data
 - EntityScoreProvider to provide or calculate the score based on the entity data
This modus is optimal in case the data are provided by a source that does not allow ID based retrieval (e.g. a file). It is often also the preferred mode when one needs to index all entities.

for (b) the following interfaces are used
 - EntityIterator: iterator over entity id and score
 - EntityDataProvider: used to lookup the data for the entity based on the id
This modus is intended to be used if one wants only to index an part of all the entities provided by the source. The EntityIterator can be used to specify the entities to be indexed (e.g. based on a file providing the IDs of the entities to be indexed) This feature will be needed to resolve STANBOL-92, STANBOL-93 and STANBOL-163.


(2) Score Normaliser

This Interface provides the possibility to process score values provided for Entities (e.g. to calculate the pageRank based on the number of incoming links)
The Score Normaliser is an optional component. If one is present it is applied to the score provided by the Indexing Source.
The Score Normaliser interface supports chaining of different instances (e.g. first calculate the natural Log of the incoming links and than normalizing the returned values within the range [0..1].

(3) EntityProcessor

This Interface takes a Representation (data of the entity) as input and returns a modified version. This is an optional component.
The intension is to provide an extension point for services like schema translation, filters (for fields, languages, ...).  An EntityProcessor that uses the FieldMapping functionally of the Entityhub is included.

(4) IndexingDestination

This interface is used to get the Yard (storage component of the Entityhub) to store the processed entities. In addition it defines a method that is used by the indexer to tell the destination that the indexing has finished. Implementations need to support the creation of distribution files used to load the indexed data into the Entityhub.


Indexing Process:

The indexing process is defined by the Indexer interface and implemented by the IndexerImpl. Indexer instances are created by using the IndexerFactory.

The process defines the following state:
 - UNINITIALISED: All components are present and configured but not yet initialized
 - INITIALISING: During the initialization
 - INITIALISED: The initialization of the components has finished. Ready to start the indexing
 - INDEXING: During the indexing process
 - INDEXED: The indexing of the entities has finished
 - FINALISING: during the finalization phase (e.g. creating the distribution files)
 - FINISHED: The indexing has finished.

The indexing interface provides the index() method that allows to perform the whole process with a single method call. It also defines methods to perform the single steps of the indexing process
 - initialiseIndexingSources(): UNINITIALISED > INITIALISED
 - indexAllEntities(): INITIALISED > INDEXED
 - finaliseIndexingTarget(): INDEXED > FINISHED

All these methods will block until the target state is reached. The index() method can be called in any of the UNINITIALISED, INITIALISED and INDEXED and will block until the FINISHED state is reached.

The indexing process uses the consumer/producer pattern where the
 - Indexing Source produces Indexed Entities
 - Entity Processor consumes Indexed Entities and produces Processed Entities
 - Indexing Destination consumes Processed Entities and produces Finished Entities
 - an internal component consumes Finished Entities and provides status updates every 10000 indexed entities
In addition every component can produce Errors that are processed (currently only logged) by an Error Processor
An interface that allows to register an own component that can handle errors will be added later.

Currently a single thread is used for each component, but the implementation would already support the usage of multiple threads (e.g. to process entities). However note that the different steps do run simultaneously. BlockingQueues are used to buffer some entities between the steps.

Configuration of the Indexing Process:

The configuration of the IndexingProcess is based on the following file structure

/indexing -> the root folder
/indexing/config -> the folder holding all the configuration
/indexing/config/indexing.properties -> the main configuration file
/indexing/resources/ -> provides the resources for the indexing process (e.g. the Files with the entity data, scores, schema definitions …)
/indexing/destination/ -> stores data created by the indexing process (e.g. the Solr Index with the indexed entities)
/indexing/dist/ -> contains the files needed to load the indexed data into the Entityhub

Some details to the "indexing.properties" File:

It uses the following syntax:
{key}={value1},{param1}:{paramValue},{param2}:paramValue2};{value2}…

keys:
 - Supported keys are defined in IndexingConstants
 - Full UTF-8 can be used for keys (java.util.Properties is NOT used for parsing)

value:
 - multiple values are separated by ';'
 - parameters can be added to values. The first parameter starts after the first ','

param:
 - multiple parameters are separated by ','
 - The first ':' is used to separate the parameter name with the parameter value.
 - A parameter MUST NOT have an value
 - the indexing configuration defines some parameter that can be used with every configuration. Other parameter are not processed but parsed to the component associated with the current value. (see setConfiguration method in the IndexingComponent interface)

special parameter: 

The "config" the value of this parameter is used to load additional properties form a config file form the "/indexing/config" directory.
e.g. the configuration

scoreNormalizer=org.apache.stanbol.entityhub.indexing.core.normaliser.RangeNormaliser,config:range

would load the configuration rom the file "/indexing/config/range.properties" and parse it to the RangeNormaliser instance.

NOTE: the the IndexingConfig instance is also parsed to the components by using the key "indexingConfig" (IndexingConfig.KEY_INDEXING_CONFIG)

The unit tests within the indexing core bundle are a good starting point for exploring how to use/configure the new indexing infrastructure. As soon as the current indexing utilities are moved to this new infrastructure they will provide even better examples.

best
Rupert Westenthaler

> Extendable indexing infrastructure for the Entityhub
> ----------------------------------------------------
>
>                 Key: STANBOL-187
>                 URL: https://issues.apache.org/jira/browse/STANBOL-187
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entity Hub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Currently the Entityhub includes some utilities to create Indexes for dbPedia, geonames and dblp. There exists also an generic RDF indexer that is used by the dbPedia and dblp however also this implementation is not extendable and not really suitable to add features requested by issues like STANBOL-92, STANBOL-93 and STANBOL-163.
> The goal is to create an infrastructure that provides an implementation of
>  - the indexing workflow
>  - configuration and initialization
> and defines Interfaces that allows to plug in
>  - different Data Sources
>  - entity ranking implementations
>  - entity data mapper (e.g. filtering some fields, schema translations ...)
>  - indexing targets (the Yard that stores the indexed entities)
> The existing Indexing utilities need to be moved to use the new Infrastructure

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-187) Extendable indexing infrastructure for the Entityhub

Posted by "Rupert Westenthaler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/STANBOL-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043294#comment-13043294 ] 

Rupert Westenthaler commented on STANBOL-187:
---------------------------------------------

(2) is working as intended. If I would remove empty lines and comments in that view, than one could no longer change them without removing all comments in the original configuration.
(1) looks like an Bug in the FieldQuery implementation of the SolrYard. I will check that

> Extendable indexing infrastructure for the Entityhub
> ----------------------------------------------------
>
>                 Key: STANBOL-187
>                 URL: https://issues.apache.org/jira/browse/STANBOL-187
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entity Hub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Currently the Entityhub includes some utilities to create Indexes for dbPedia, geonames and dblp. There exists also an generic RDF indexer that is used by the dbPedia and dblp however also this implementation is not extendable and not really suitable to add features requested by issues like STANBOL-92, STANBOL-93 and STANBOL-163.
> The goal is to create an infrastructure that provides an implementation of
>  - the indexing workflow
>  - configuration and initialization
> and defines Interfaces that allows to plug in
>  - different Data Sources
>  - entity ranking implementations
>  - entity data mapper (e.g. filtering some fields, schema translations ...)
>  - indexing targets (the Yard that stores the indexed entities)
> The existing Indexing utilities need to be moved to use the new Infrastructure

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-187) Extendable indexing infrastructure for the Entityhub

Posted by "Florent ANDRE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/STANBOL-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043272#comment-13043272 ] 

Florent ANDRE commented on STANBOL-187:
---------------------------------------

Hi Rupert, 

Thanks for this useful add, that work well (faster than a D2RQ call) and very simple (auto-install of configured bundles in perfect ! )

I have 2 remarks thought (Last Changed Rev: 1130735)

1) When indexing a skos file, only terms with multi-words are indexed, and not term with one word. I observe this first on my particular thesaurus then also in the iptc one. I try this request 
$ curl -X POST -F "query=@fieldQuery.json" http://localhost:8080/entityhub/site/iptc/query 
with queries : 

1.A) @fieldQuery.json = 
{
    "offset": "0", 
    "limit": "30", 
    "constraints": [
    	{ 
          "type": "value", 
          "field": "http:\/\/www.w3.org\/2004\/02\/skos\/core#prefLabel", 
          "value": "Africa", 
    	} 
    ]
}

==> output no results

1.B) @fieldQuery.json =
{
    "offset": "0", 
    "limit": "30", 
    "constraints": [
    	{ 
          "type": "value", 
          "field": "http:\/\/www.w3.org\/2004\/02\/skos\/core#prefLabel", 
          "value": "South America", 
    	} 
    ]
}


==> output results.

"Africa" and "South America" are skos:prefLabel in world-region.rdf in iptc dataset.

2) When open the "Entity hub referenced site configuration" for imported site in Felix/Sling console configuration, the "Fields mapping" part contain all the mapping.txt file with blank and comment (#) lines, and not only mappings. It may be expected.

Cheers.
++


> Extendable indexing infrastructure for the Entityhub
> ----------------------------------------------------
>
>                 Key: STANBOL-187
>                 URL: https://issues.apache.org/jira/browse/STANBOL-187
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entity Hub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Currently the Entityhub includes some utilities to create Indexes for dbPedia, geonames and dblp. There exists also an generic RDF indexer that is used by the dbPedia and dblp however also this implementation is not extendable and not really suitable to add features requested by issues like STANBOL-92, STANBOL-93 and STANBOL-163.
> The goal is to create an infrastructure that provides an implementation of
>  - the indexing workflow
>  - configuration and initialization
> and defines Interfaces that allows to plug in
>  - different Data Sources
>  - entity ranking implementations
>  - entity data mapper (e.g. filtering some fields, schema translations ...)
>  - indexing targets (the Yard that stores the indexed entities)
> The existing Indexing utilities need to be moved to use the new Infrastructure

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-187) Extendable indexing infrastructure for the Entityhub

Posted by "Rupert Westenthaler (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/STANBOL-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181401#comment-13181401 ] 

Rupert Westenthaler commented on STANBOL-187:
---------------------------------------------

The only open Point is:
 
Port the geonames.org indexer: It does not use RDF, but directly reads/processes the DB dumps. Therefore a customized Indexing Source has to be implemented based on the current implementation 
                
> Extendable indexing infrastructure for the Entityhub
> ----------------------------------------------------
>
>                 Key: STANBOL-187
>                 URL: https://issues.apache.org/jira/browse/STANBOL-187
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entity Hub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Currently the Entityhub includes some utilities to create Indexes for dbPedia, geonames and dblp. There exists also an generic RDF indexer that is used by the dbPedia and dblp however also this implementation is not extendable and not really suitable to add features requested by issues like STANBOL-92, STANBOL-93 and STANBOL-163.
> The goal is to create an infrastructure that provides an implementation of
>  - the indexing workflow
>  - configuration and initialization
> and defines Interfaces that allows to plug in
>  - different Data Sources
>  - entity ranking implementations
>  - entity data mapper (e.g. filtering some fields, schema translations ...)
>  - indexing targets (the Yard that stores the indexed entities)
> The existing Indexing utilities need to be moved to use the new Infrastructure

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (STANBOL-187) Extendable indexing infrastructure for the Entityhub

Posted by "Rupert Westenthaler (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/STANBOL-187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rupert Westenthaler resolved STANBOL-187.
-----------------------------------------

    Resolution: Fixed

Making as resolved. No intension to port the geonames indexer for now
                
> Extendable indexing infrastructure for the Entityhub
> ----------------------------------------------------
>
>                 Key: STANBOL-187
>                 URL: https://issues.apache.org/jira/browse/STANBOL-187
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entity Hub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Currently the Entityhub includes some utilities to create Indexes for dbPedia, geonames and dblp. There exists also an generic RDF indexer that is used by the dbPedia and dblp however also this implementation is not extendable and not really suitable to add features requested by issues like STANBOL-92, STANBOL-93 and STANBOL-163.
> The goal is to create an infrastructure that provides an implementation of
>  - the indexing workflow
>  - configuration and initialization
> and defines Interfaces that allows to plug in
>  - different Data Sources
>  - entity ranking implementations
>  - entity data mapper (e.g. filtering some fields, schema translations ...)
>  - indexing targets (the Yard that stores the indexed entities)
> The existing Indexing utilities need to be moved to use the new Infrastructure

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (STANBOL-187) Extendable indexing infrastructure for the Entityhub

Posted by "Rupert Westenthaler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/STANBOL-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043060#comment-13043060 ] 

Rupert Westenthaler commented on STANBOL-187:
---------------------------------------------

The Indexing Utils now create also a Bundle that - when installed - adds and configures all the components needed to use the indexed data with the Stanbol Entityhub:

The two files needed are within the "/indexing/dist" folder

To install an index

1. copy the "{name}.solrindex.zip" to the "/sling/datafiles" folder within the home directory of your running Apache Stanbol instance.
2. Install the bundle "org.apache.stanbol.data.site.{name}-1.0.0.jar" by
 * Go to the OSGI Webconsole (http://{host}:{port}/system/console/bundles)
 * Click on "Install/update…"
 * Add this Bundle to the Dialog and activate the "Start Bundle" option
 * Reload the page. Now you should see a Bundle with the Name "Apache Stanbol Data: iptc (org.apache.stanbol.data.site.{name}) "and the Satus "Active"
 * The indexed dataset is now available as ReferencedSite at "http://{host}:{port}/entityhub/site/{name}"
3. If you want you can not delete the "{name}.solrindex.zip" in the "/sling/datafiles" folder.



> Extendable indexing infrastructure for the Entityhub
> ----------------------------------------------------
>
>                 Key: STANBOL-187
>                 URL: https://issues.apache.org/jira/browse/STANBOL-187
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entity Hub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Currently the Entityhub includes some utilities to create Indexes for dbPedia, geonames and dblp. There exists also an generic RDF indexer that is used by the dbPedia and dblp however also this implementation is not extendable and not really suitable to add features requested by issues like STANBOL-92, STANBOL-93 and STANBOL-163.
> The goal is to create an infrastructure that provides an implementation of
>  - the indexing workflow
>  - configuration and initialization
> and defines Interfaces that allows to plug in
>  - different Data Sources
>  - entity ranking implementations
>  - entity data mapper (e.g. filtering some fields, schema translations ...)
>  - indexing targets (the Yard that stores the indexed entities)
> The existing Indexing utilities need to be moved to use the new Infrastructure

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-187) Extendable indexing infrastructure for the Entityhub

Posted by "Rupert Westenthaler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/STANBOL-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033965#comment-13033965 ] 

Rupert Westenthaler commented on STANBOL-187:
---------------------------------------------

Status Update

The new Indexing Infrastructure is functional. The Indexer for DBLP and DBpedia are already ported.
The old generic RDF indexer was deleted and is no longer used.

Still open:
 
(1) Port the geonames.org indexer: It does not use RDF, but directly reads/processes the DB dumps. Therefore a customized Indexing Source has to be implemented based on the current implementation
(2) Add support to the Solr Yard destination to create a OSGI configuration file for the SolrYard that loads the index based on the Solr Archive or Solr Archive Reference (the two files already created in the distribution folder)



> Extendable indexing infrastructure for the Entityhub
> ----------------------------------------------------
>
>                 Key: STANBOL-187
>                 URL: https://issues.apache.org/jira/browse/STANBOL-187
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entity Hub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Currently the Entityhub includes some utilities to create Indexes for dbPedia, geonames and dblp. There exists also an generic RDF indexer that is used by the dbPedia and dblp however also this implementation is not extendable and not really suitable to add features requested by issues like STANBOL-92, STANBOL-93 and STANBOL-163.
> The goal is to create an infrastructure that provides an implementation of
>  - the indexing workflow
>  - configuration and initialization
> and defines Interfaces that allows to plug in
>  - different Data Sources
>  - entity ranking implementations
>  - entity data mapper (e.g. filtering some fields, schema translations ...)
>  - indexing targets (the Yard that stores the indexed entities)
> The existing Indexing utilities need to be moved to use the new Infrastructure

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira