You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Michel Benevento <mb...@kanker.nl> on 2012/03/26 16:40:46 UTC

Namespaces accumulate on refresh

Hello,

As I am experimenting with various versions of my importfile I have changed my namespace urls. But when I refresh the index, the old namespaces keep accumulating in my results, resulting in duplicates. Is this intended behavior? How can I get rid of these (cached?) results and return to a pristine state?

Thanks,
Michel

Re: Namespaces accumulate on refresh

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Michael

On Tue, Mar 27, 2012 at 9:55 AM, Michel Benevento <mb...@kanker.nl> wrote:
> Succes!
>
> resources/tdb was the culprit, thank you Rupert.
>

good to hear

> PS Maybe it should be a setting in indexing.properties(?) if you want to override or append to an index?
>

This would be an other possibility to solve this issue it that with
the named graphs does not work out. I prefer the named graph thing
because it would work "magically" - without the need that the users
provides any kind of configuration.

However a property like that would be a good Idea for
enabling/disabling the automatic deletion of the destination folder.

best
Rupert

>
>
> On 27 mrt. 2012, at 09:38, Rupert Westenthaler wrote:
>
>> Hi Michael
>>
>> Can you please try the following
>>
>> On Mon, Mar 26, 2012 at 5:51 PM, Michel Benevento <mb...@kanker.nl> wrote:
>>
>>> rm ../stanbol/sling/datafiles/TZW.solrindex.zip
>>> sleep 5
>>> cd TZW
>>> rm -rf indexing/destination
>>> rm -rf indexing/dist
>>
>> rm -rf indexing/resource/tdb
>>
>>> java -jar org.apache.stanbol.entityhub.indexing.genericrdf-0.9.0-incubating-SNAPSHOT-jar-with-dependencies.jar index
>>> mv indexing/dist/TZW.solrindex.zip ../../stanbol/sling/datafiles
>>>
>>
>> The "indexing/resource/tdb" folder contains the Jena TDB triplestore
>> with the imported RDF data. This data are kept in-between indexing
>> processes mainly because the time needed to import the RDF data is
>> typically approximately the same as needed for the indexing process.
>> Because of that it makes a lot of sense to reuse already imported RDF
>> data if you index RDF dumps (e.g. DBpedia).
>>
>> In the case where the RDF data change this default is not optimal,
>> because the changed dataset is appended to data already present in the
>> Jena TDB store. This means that if you change or remove things in your
>> thesaurus they will still be present within the triple store and
>> therefore also appear in the created index.
>>
>> I must say that it is very confusing if users need to delete something
>> within the "/indexing/resources" folder if they change the RDF data.
>> So I will create an issue to change this behavior. I think I will try
>> to create named graphs for each imported RDF file. This would allow to
>> automatically delete already existing data within the Jena TDB store
>> if a file with the same name is imported again.
>>
>> Can you please check and report back if this is the cause of your problem.
>>
>> Thanks in advance
>>
>> best
>> Rupert
>>
>>>
>>> On 26 mrt. 2012, at 17:11, Rupert Westenthaler wrote:
>>>
>>>> Hi Michel
>>>> On 26.03.2012, at 16:40, Michel Benevento wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> As I am experimenting with various versions of my importfile I have changed my namespace urls. But when I refresh the index, the old namespaces keep accumulating in my results, resulting in duplicates. Is this intended behavior? How can I get rid of these (cached?) results and return to a pristine state?
>>>>>
>>>>
>>>> I think I have an explanation for what you are seeing. Can you please check that.
>>>>
>>>> The indexing tool does NOT delete the "{indexing-root}/indexing/destination" folder. So if you index your data twice without deleting this folder the new data will be appended. This would explain why you still see the data with the old namespaces. So please try to delete the indexing/destination folder and index again.
>>>>
>>>> This behavior is not a bug, but a feature because is allows to index multiple datasets. I am currently writing some documentation on that so I will copy the section related to the end of this mail.
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> - - -
>>>> ### Indexing Datasets separately
>>>>
>>>> This demo indexes all four datasets in a single step. However this is not required. With a simple trick it is possible to index different datasets with different indexing configurations to the same target. This section describes how this could be achieved and why users might want to do this.
>>>>
>>>> This demo uses Solr as target for the indexing process. Theoretically there might be several possibility, but currently this is the only available IndexingDestination implementation. The SolrIdnex used to store the data is located at "{indexing-root}/indexing/destination/indexes/default/{name}. If this directory does not alread exist it is initialized by the indexing tool based on the SolrCore configuration in "{indexing-root}/indexing/config/{name}" or the default SolrCore configuration of not present. However if it already exists than this core is used and the data of the current indexing process are added to the existing SolrCore.
>>>>
>>>> Because of that is is possible to subsequently add information of different datasets to the same SolrIndex. However users need to know that if the different dataset contain the same entity (resource with the same URI) the information of the second dataset will replace those of the first. Nonetheless this would allow in the given demo to create separate configurations (e.g. mappings) for all four datasets while still ensuring the indexed data are contained in the same SolrIndex.
>>>>
>>>> This might be useful in situations where the same property (e.g. rdfs:label) is used by the different datasets in different ways. Because than one could create a mapping for dataset1 that maps rdfs:label > skos:prefLabel and for dataset2 an mapping that ensures that rdfs:label > skos:altLabel.
>>>>
>>>> Workflows like that can be easily implemented by shell scrips or by setting soft links in the file system.
>>>
>>
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Namespaces accumulate on refresh

Posted by Michel Benevento <mb...@kanker.nl>.
Succes!

resources/tdb was the culprit, thank you Rupert.

Michel


PS Maybe it should be a setting in indexing.properties(?) if you want to override or append to an index?



On 27 mrt. 2012, at 09:38, Rupert Westenthaler wrote:

> Hi Michael
> 
> Can you please try the following
> 
> On Mon, Mar 26, 2012 at 5:51 PM, Michel Benevento <mb...@kanker.nl> wrote:
> 
>> rm ../stanbol/sling/datafiles/TZW.solrindex.zip
>> sleep 5
>> cd TZW
>> rm -rf indexing/destination
>> rm -rf indexing/dist
> 
> rm -rf indexing/resource/tdb
> 
>> java -jar org.apache.stanbol.entityhub.indexing.genericrdf-0.9.0-incubating-SNAPSHOT-jar-with-dependencies.jar index
>> mv indexing/dist/TZW.solrindex.zip ../../stanbol/sling/datafiles
>> 
> 
> The "indexing/resource/tdb" folder contains the Jena TDB triplestore
> with the imported RDF data. This data are kept in-between indexing
> processes mainly because the time needed to import the RDF data is
> typically approximately the same as needed for the indexing process.
> Because of that it makes a lot of sense to reuse already imported RDF
> data if you index RDF dumps (e.g. DBpedia).
> 
> In the case where the RDF data change this default is not optimal,
> because the changed dataset is appended to data already present in the
> Jena TDB store. This means that if you change or remove things in your
> thesaurus they will still be present within the triple store and
> therefore also appear in the created index.
> 
> I must say that it is very confusing if users need to delete something
> within the "/indexing/resources" folder if they change the RDF data.
> So I will create an issue to change this behavior. I think I will try
> to create named graphs for each imported RDF file. This would allow to
> automatically delete already existing data within the Jena TDB store
> if a file with the same name is imported again.
> 
> Can you please check and report back if this is the cause of your problem.
> 
> Thanks in advance
> 
> best
> Rupert
> 
>> 
>> On 26 mrt. 2012, at 17:11, Rupert Westenthaler wrote:
>> 
>>> Hi Michel
>>> On 26.03.2012, at 16:40, Michel Benevento wrote:
>>> 
>>>> Hello,
>>>> 
>>>> As I am experimenting with various versions of my importfile I have changed my namespace urls. But when I refresh the index, the old namespaces keep accumulating in my results, resulting in duplicates. Is this intended behavior? How can I get rid of these (cached?) results and return to a pristine state?
>>>> 
>>> 
>>> I think I have an explanation for what you are seeing. Can you please check that.
>>> 
>>> The indexing tool does NOT delete the "{indexing-root}/indexing/destination" folder. So if you index your data twice without deleting this folder the new data will be appended. This would explain why you still see the data with the old namespaces. So please try to delete the indexing/destination folder and index again.
>>> 
>>> This behavior is not a bug, but a feature because is allows to index multiple datasets. I am currently writing some documentation on that so I will copy the section related to the end of this mail.
>>> 
>>> best
>>> Rupert
>>> 
>>> - - -
>>> ### Indexing Datasets separately
>>> 
>>> This demo indexes all four datasets in a single step. However this is not required. With a simple trick it is possible to index different datasets with different indexing configurations to the same target. This section describes how this could be achieved and why users might want to do this.
>>> 
>>> This demo uses Solr as target for the indexing process. Theoretically there might be several possibility, but currently this is the only available IndexingDestination implementation. The SolrIdnex used to store the data is located at "{indexing-root}/indexing/destination/indexes/default/{name}. If this directory does not alread exist it is initialized by the indexing tool based on the SolrCore configuration in "{indexing-root}/indexing/config/{name}" or the default SolrCore configuration of not present. However if it already exists than this core is used and the data of the current indexing process are added to the existing SolrCore.
>>> 
>>> Because of that is is possible to subsequently add information of different datasets to the same SolrIndex. However users need to know that if the different dataset contain the same entity (resource with the same URI) the information of the second dataset will replace those of the first. Nonetheless this would allow in the given demo to create separate configurations (e.g. mappings) for all four datasets while still ensuring the indexed data are contained in the same SolrIndex.
>>> 
>>> This might be useful in situations where the same property (e.g. rdfs:label) is used by the different datasets in different ways. Because than one could create a mapping for dataset1 that maps rdfs:label > skos:prefLabel and for dataset2 an mapping that ensures that rdfs:label > skos:altLabel.
>>> 
>>> Workflows like that can be easily implemented by shell scrips or by setting soft links in the file system.
>> 
> 
> 
> 
> -- 
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen


Re: Namespaces accumulate on refresh

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Michael

Can you please try the following

On Mon, Mar 26, 2012 at 5:51 PM, Michel Benevento <mb...@kanker.nl> wrote:

> rm ../stanbol/sling/datafiles/TZW.solrindex.zip
> sleep 5
> cd TZW
> rm -rf indexing/destination
> rm -rf indexing/dist

rm -rf indexing/resource/tdb

> java -jar org.apache.stanbol.entityhub.indexing.genericrdf-0.9.0-incubating-SNAPSHOT-jar-with-dependencies.jar index
> mv indexing/dist/TZW.solrindex.zip ../../stanbol/sling/datafiles
>

The "indexing/resource/tdb" folder contains the Jena TDB triplestore
with the imported RDF data. This data are kept in-between indexing
processes mainly because the time needed to import the RDF data is
typically approximately the same as needed for the indexing process.
Because of that it makes a lot of sense to reuse already imported RDF
data if you index RDF dumps (e.g. DBpedia).

In the case where the RDF data change this default is not optimal,
because the changed dataset is appended to data already present in the
Jena TDB store. This means that if you change or remove things in your
thesaurus they will still be present within the triple store and
therefore also appear in the created index.

I must say that it is very confusing if users need to delete something
within the "/indexing/resources" folder if they change the RDF data.
So I will create an issue to change this behavior. I think I will try
to create named graphs for each imported RDF file. This would allow to
automatically delete already existing data within the Jena TDB store
if a file with the same name is imported again.

Can you please check and report back if this is the cause of your problem.

Thanks in advance

best
Rupert

>
> On 26 mrt. 2012, at 17:11, Rupert Westenthaler wrote:
>
>> Hi Michel
>> On 26.03.2012, at 16:40, Michel Benevento wrote:
>>
>>> Hello,
>>>
>>> As I am experimenting with various versions of my importfile I have changed my namespace urls. But when I refresh the index, the old namespaces keep accumulating in my results, resulting in duplicates. Is this intended behavior? How can I get rid of these (cached?) results and return to a pristine state?
>>>
>>
>> I think I have an explanation for what you are seeing. Can you please check that.
>>
>> The indexing tool does NOT delete the "{indexing-root}/indexing/destination" folder. So if you index your data twice without deleting this folder the new data will be appended. This would explain why you still see the data with the old namespaces. So please try to delete the indexing/destination folder and index again.
>>
>> This behavior is not a bug, but a feature because is allows to index multiple datasets. I am currently writing some documentation on that so I will copy the section related to the end of this mail.
>>
>> best
>> Rupert
>>
>> - - -
>> ### Indexing Datasets separately
>>
>> This demo indexes all four datasets in a single step. However this is not required. With a simple trick it is possible to index different datasets with different indexing configurations to the same target. This section describes how this could be achieved and why users might want to do this.
>>
>> This demo uses Solr as target for the indexing process. Theoretically there might be several possibility, but currently this is the only available IndexingDestination implementation. The SolrIdnex used to store the data is located at "{indexing-root}/indexing/destination/indexes/default/{name}. If this directory does not alread exist it is initialized by the indexing tool based on the SolrCore configuration in "{indexing-root}/indexing/config/{name}" or the default SolrCore configuration of not present. However if it already exists than this core is used and the data of the current indexing process are added to the existing SolrCore.
>>
>> Because of that is is possible to subsequently add information of different datasets to the same SolrIndex. However users need to know that if the different dataset contain the same entity (resource with the same URI) the information of the second dataset will replace those of the first. Nonetheless this would allow in the given demo to create separate configurations (e.g. mappings) for all four datasets while still ensuring the indexed data are contained in the same SolrIndex.
>>
>> This might be useful in situations where the same property (e.g. rdfs:label) is used by the different datasets in different ways. Because than one could create a mapping for dataset1 that maps rdfs:label > skos:prefLabel and for dataset2 an mapping that ensures that rdfs:label > skos:altLabel.
>>
>> Workflows like that can be easily implemented by shell scrips or by setting soft links in the file system.
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Namespaces accumulate on refresh

Posted by Michel Benevento <mb...@kanker.nl>.
Hi Rupert,

I'm sorry I should have clarified. I already deleted both the dist and destination folders before reindexing, see below for the script I use. That didn't work. I have now resorted to reinitializing the entire indexing setup, reinstalling the jar, and rebuilding the index in /sling/indexes/default/... and I am OK. But this is definitely something inside Stanbol, as I have been judiciously deleting those indexing folders.

Thanks,
Michel

rm ../stanbol/sling/datafiles/TZW.solrindex.zip
sleep 5
cd TZW
rm -rf indexing/destination
rm -rf indexing/dist
java -jar org.apache.stanbol.entityhub.indexing.genericrdf-0.9.0-incubating-SNAPSHOT-jar-with-dependencies.jar index
mv indexing/dist/TZW.solrindex.zip ../../stanbol/sling/datafiles


On 26 mrt. 2012, at 17:11, Rupert Westenthaler wrote:

> Hi Michel
> On 26.03.2012, at 16:40, Michel Benevento wrote:
> 
>> Hello,
>> 
>> As I am experimenting with various versions of my importfile I have changed my namespace urls. But when I refresh the index, the old namespaces keep accumulating in my results, resulting in duplicates. Is this intended behavior? How can I get rid of these (cached?) results and return to a pristine state?
>> 
> 
> I think I have an explanation for what you are seeing. Can you please check that.
> 
> The indexing tool does NOT delete the "{indexing-root}/indexing/destination" folder. So if you index your data twice without deleting this folder the new data will be appended. This would explain why you still see the data with the old namespaces. So please try to delete the indexing/destination folder and index again.
> 
> This behavior is not a bug, but a feature because is allows to index multiple datasets. I am currently writing some documentation on that so I will copy the section related to the end of this mail.
> 
> best
> Rupert
> 
> - - -
> ### Indexing Datasets separately
> 
> This demo indexes all four datasets in a single step. However this is not required. With a simple trick it is possible to index different datasets with different indexing configurations to the same target. This section describes how this could be achieved and why users might want to do this.
> 
> This demo uses Solr as target for the indexing process. Theoretically there might be several possibility, but currently this is the only available IndexingDestination implementation. The SolrIdnex used to store the data is located at "{indexing-root}/indexing/destination/indexes/default/{name}. If this directory does not alread exist it is initialized by the indexing tool based on the SolrCore configuration in "{indexing-root}/indexing/config/{name}" or the default SolrCore configuration of not present. However if it already exists than this core is used and the data of the current indexing process are added to the existing SolrCore.
> 
> Because of that is is possible to subsequently add information of different datasets to the same SolrIndex. However users need to know that if the different dataset contain the same entity (resource with the same URI) the information of the second dataset will replace those of the first. Nonetheless this would allow in the given demo to create separate configurations (e.g. mappings) for all four datasets while still ensuring the indexed data are contained in the same SolrIndex.
> 
> This might be useful in situations where the same property (e.g. rdfs:label) is used by the different datasets in different ways. Because than one could create a mapping for dataset1 that maps rdfs:label > skos:prefLabel and for dataset2 an mapping that ensures that rdfs:label > skos:altLabel.
> 
> Workflows like that can be easily implemented by shell scrips or by setting soft links in the file system.


Re: Namespaces accumulate on refresh

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Michel
On 26.03.2012, at 16:40, Michel Benevento wrote:

> Hello,
> 
> As I am experimenting with various versions of my importfile I have changed my namespace urls. But when I refresh the index, the old namespaces keep accumulating in my results, resulting in duplicates. Is this intended behavior? How can I get rid of these (cached?) results and return to a pristine state?
> 

I think I have an explanation for what you are seeing. Can you please check that.

The indexing tool does NOT delete the "{indexing-root}/indexing/destination" folder. So if you index your data twice without deleting this folder the new data will be appended. This would explain why you still see the data with the old namespaces. So please try to delete the indexing/destination folder and index again.

This behavior is not a bug, but a feature because is allows to index multiple datasets. I am currently writing some documentation on that so I will copy the section related to the end of this mail.

best
Rupert

- - -
### Indexing Datasets separately

This demo indexes all four datasets in a single step. However this is not required. With a simple trick it is possible to index different datasets with different indexing configurations to the same target. This section describes how this could be achieved and why users might want to do this.

This demo uses Solr as target for the indexing process. Theoretically there might be several possibility, but currently this is the only available IndexingDestination implementation. The SolrIdnex used to store the data is located at "{indexing-root}/indexing/destination/indexes/default/{name}. If this directory does not alread exist it is initialized by the indexing tool based on the SolrCore configuration in "{indexing-root}/indexing/config/{name}" or the default SolrCore configuration of not present. However if it already exists than this core is used and the data of the current indexing process are added to the existing SolrCore.

Because of that is is possible to subsequently add information of different datasets to the same SolrIndex. However users need to know that if the different dataset contain the same entity (resource with the same URI) the information of the second dataset will replace those of the first. Nonetheless this would allow in the given demo to create separate configurations (e.g. mappings) for all four datasets while still ensuring the indexed data are contained in the same SolrIndex.

This might be useful in situations where the same property (e.g. rdfs:label) is used by the different datasets in different ways. Because than one could create a mapping for dataset1 that maps rdfs:label > skos:prefLabel and for dataset2 an mapping that ensures that rdfs:label > skos:altLabel.

Workflows like that can be easily implemented by shell scrips or by setting soft links in the file system.