You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2012/04/24 20:47:35 UTC

[jira] [Created] (STANBOL-593) EntityIterator implementation based on Jena TDB that allows to filter Entities based on Triple Filters

Rupert Westenthaler created STANBOL-593:
-------------------------------------------

             Summary: EntityIterator implementation based on Jena TDB that allows to filter Entities based on Triple Filters
                 Key: STANBOL-593
                 URL: https://issues.apache.org/jira/browse/STANBOL-593
             Project: Stanbol
          Issue Type: New Feature
          Components: Entity Hub
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler
             Fix For: 0.10.0-incubating


The FieldValueProcessor (EntityProcessor) already allows to filter Entities based on Triple Filters. However this requires to Iterate over all entities - something very ineffective if one wants only to index a rather small fraction of all Entities.

To achieve better performance in such cases one needs an Component that uses a similar functionality to filter Entities within the Indexing Source. Such an implementation is very easy to implement based on Jena TDB as the low level API natively supports filtered iterators.

Indexing configurations would than use a EntityIterator/EntityDataProvider combination as source for the indexing. A according configuration would look like


    entityIdIterator=org.apache.stanbol.entityhub.indexing.source.jenatdb.ResourceFilterIterator,config:entityTypes.properties
    entityDataProvider=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,source:rdfdata

the entityTypes.properties file would require the following properties

    field=rdf:type
    values=dbp-ont:Person;dbp-ont:Place;dbp-ont:Organisation

With this configuration the indexing process would only iterate over Persons, Places and Organisations present within the IndexingSource.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (STANBOL-593) EntityIterator implementation based on Jena TDB that allows to filter Entities based on Triple Filters

Posted by "Rupert Westenthaler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/STANBOL-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rupert Westenthaler resolved STANBOL-593.
-----------------------------------------

    Resolution: Fixed

implemented with revision #1330107
                
> EntityIterator implementation based on Jena TDB that allows to filter Entities based on Triple Filters
> ------------------------------------------------------------------------------------------------------
>
>                 Key: STANBOL-593
>                 URL: https://issues.apache.org/jira/browse/STANBOL-593
>             Project: Stanbol
>          Issue Type: New Feature
>          Components: Entity Hub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>             Fix For: 0.10.0-incubating
>
>
> The FieldValueProcessor (EntityProcessor) already allows to filter Entities based on Triple Filters. However this requires to Iterate over all entities - something very ineffective if one wants only to index a rather small fraction of all Entities.
> To achieve better performance in such cases one needs an Component that uses a similar functionality to filter Entities within the Indexing Source. Such an implementation is very easy to implement based on Jena TDB as the low level API natively supports filtered iterators.
> Indexing configurations would than use a EntityIterator/EntityDataProvider combination as source for the indexing. A according configuration would look like
>     entityIdIterator=org.apache.stanbol.entityhub.indexing.source.jenatdb.ResourceFilterIterator,config:entityTypes.properties
>     entityDataProvider=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,source:rdfdata
> the entityTypes.properties file would require the following properties
>     field=rdf:type
>     values=dbp-ont:Person;dbp-ont:Place;dbp-ont:Organisation
> With this configuration the indexing process would only iterate over Persons, Places and Organisations present within the IndexingSource.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira