You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Florent André <fl...@apache.org> on 2011/06/01 10:16:03 UTC

Re: How to add an rdf/skos file in entity hub

Hi Rupert,

Thanks for your valuables answers !

In fact, if get it now, the meaning of indexing in entity hub is not 
just about index, but about create a new (offline) entity hub.

You said :
 > The Solr Yard provides better performance especially for big Datasets.
...
 > The Clerezza  is fine for smaller data sets.

Do you have a "magic number" (a vague will be fine :) ) that define the 
limit for a big dataset ?

Looking forward to try your indexer utility configuration for generic 
RDF files. If I can help, let me know.

++

On 05/28/2011 06:58 PM, Rupert Westenthaler wrote:
> Hi florent
>
> On Sat, May 28, 2011 at 5:50 PM, florent andré
> <fl...@4sengines.com>  wrote:
>> Hi Stanbolers,
>>
>> I have in my hand a big skos file.
>> My main question is : How I can create an entity site with this file and
>> play with it ?
>>
>> I have read READMEs about indexing [1] and data set [2], but I'm not sure to
>> get all :
>> - can I get rid of set-up of a sparql server ? Or it's require ?
>
> If you index the SKOS file, than you do not need an SPARQL server
>
>> - What is the goal of indexing ? Speed-up entity detection in text or
>> speed-up rdf entity representation providing ?
>
> * To get all the Information into the Entityhub
> * To support Full Test queries based on the labels and descriotions
> * To use the Information for the detection of Entities in the Text
>
>> - Stanbol data file provider, is something related to yard or not ?
>
> * It is only to load binary files that are to big to be managed in SVN
> or included within a bundle
> * It is used to e.g. to load pre-computed SolrIndexes, language models
> for Open NLP.
> * The Data File Provider is only used for configuration.
>
>> Something like a local dump of an rdf store ?
>
> Year loading big RDF files to Stanbol (e.g. a Clerezza RDF store) one
> would use the Data File Provider. However currently such a feature is
> not supported.
>
>>
>> A side question is about the differences between the clerezza and solr yard
>> : what are they ? performance ? functionalities ? ...
> The Solr Yard provides better performance especially for big Datasets.
> It does not support Regex constraints for Text queries.
> The Clerezza  is fine for smaller data sets. The RDF store used an be
> controlled by the Clerezza configuration. Therefore you have a lot of
> possibilities how to store your data.
>
> There is the plan to implement a "hybrid" Yard that uses the Clerezza
> Yard implementation for storage and the Solr Yard implementation for
> queries.
>
>>
>> Thanks for any pointers, RTMF links,...
>> ++
>
> While answering this mail I recognized, that currently there is no
> indexer utility configuration for generic RDF files.
> I will add such a configuration in the coming days. This should also
> be fine for indexing SKOS files.
>
>
> best
> Rupert
>
>>
>> [1]
>> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.md
>> [2]
>> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/data/sites/dbpedia/README.md
>>
>
>
>

Re: How to add an rdf/skos file in entity hub

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/1 Rupert Westenthaler <ru...@gmail.com>:
> Hi
>
> Based on your Request I have worked the last two days on several
> improvements of the Indexing Tool.
> Most important the Indexing Util now directly creates a Bundle that
> when installed in the Entityhub will create all the necessary
> Entityhub components to use the Indexed RDF data as an Referenced Site
>
> I have also created the generic RDF configuration with a lot of
> additional documentation.
>
> I am currently working on some final things. So expect to see the
> stuff in the SVN tomorrow.

Ooops I did not know that you had pending work on these projects. I
did some "svn mv" to fix typos in some filenames. I hope this won't
conflict to much with your work. I should have asked before doing
it...

Also for the topic categorization, I am currently upgrading my pig
scrips to be able to output .nt files directly suitable for the
Stanbol indexing utility:

  https://github.com/ogrisel/pignlproc/tree/master/examples/topic-corpus

I hope I will have time to test it and document it by next week to be
able to demonstrate it during the Berlin Buzzwords hackathon.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: How to add an rdf/skos file in entity hub

Posted by florent andré <fl...@4sengines.com>.

Sweeeeeeet ! :) Looking forward to try this out.

This will be a great tool !

PS : I found this dataset that can interest some of you : 
http://www.mpi-inf.mpg.de/yago-naga/yago/

++

On 06/01/2011 04:40 PM, Rupert Westenthaler wrote:
> Hi
>
> Based on your Request I have worked the last two days on several
> improvements of the Indexing Tool.
> Most important the Indexing Util now directly creates a Bundle that
> when installed in the Entityhub will create all the necessary
> Entityhub components to use the Indexed RDF data as an Referenced Site
>
> I have also created the generic RDF configuration with a lot of
> additional documentation.
>
> I am currently working on some final things. So expect to see the
> stuff in the SVN tomorrow.
>
> best
> Rupert Westenthaler
>
> On Wed, Jun 1, 2011 at 10:32 AM, Olivier Grisel
> <ol...@ensta.org>  wrote:
>> 2011/6/1 Florent André<fl...@apache.org>:
>>> Hi Rupert,
>>>
>>> Thanks for your valuables answers !
>>>
>>> In fact, if get it now, the meaning of indexing in entity hub is not just
>>> about index, but about create a new (offline) entity hub.
>>>
>>> You said :
>>>> The Solr Yard provides better performance especially for big Datasets.
>>> ...
>>>> The Clerezza  is fine for smaller data sets.
>>>
>>> Do you have a "magic number" (a vague will be fine :) ) that define the
>>> limit for a big dataset ?
>>
>> The SolrYard implementation should be pretty scalable (tens or
>> hundreds millions of entities). The ClerezzaYard will suffer from a
>> limitation though. It won't be scalable to more than a couple of
>> thousands of entities as long as the following is not fixed:
>>
>>   https://issues.apache.org/jira/browse/CLEREZZA-466
>>
>> --
>> Olivier
>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>
>
>
>

Re: How to add an rdf/skos file in entity hub

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi

Based on your Request I have worked the last two days on several
improvements of the Indexing Tool.
Most important the Indexing Util now directly creates a Bundle that
when installed in the Entityhub will create all the necessary
Entityhub components to use the Indexed RDF data as an Referenced Site

I have also created the generic RDF configuration with a lot of
additional documentation.

I am currently working on some final things. So expect to see the
stuff in the SVN tomorrow.

best
Rupert Westenthaler

On Wed, Jun 1, 2011 at 10:32 AM, Olivier Grisel
<ol...@ensta.org> wrote:
> 2011/6/1 Florent André <fl...@apache.org>:
>> Hi Rupert,
>>
>> Thanks for your valuables answers !
>>
>> In fact, if get it now, the meaning of indexing in entity hub is not just
>> about index, but about create a new (offline) entity hub.
>>
>> You said :
>>> The Solr Yard provides better performance especially for big Datasets.
>> ...
>>> The Clerezza  is fine for smaller data sets.
>>
>> Do you have a "magic number" (a vague will be fine :) ) that define the
>> limit for a big dataset ?
>
> The SolrYard implementation should be pretty scalable (tens or
> hundreds millions of entities). The ClerezzaYard will suffer from a
> limitation though. It won't be scalable to more than a couple of
> thousands of entities as long as the following is not fixed:
>
>  https://issues.apache.org/jira/browse/CLEREZZA-466
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: How to add an rdf/skos file in entity hub

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/1 Florent André <fl...@apache.org>:
> Hi Rupert,
>
> Thanks for your valuables answers !
>
> In fact, if get it now, the meaning of indexing in entity hub is not just
> about index, but about create a new (offline) entity hub.
>
> You said :
>> The Solr Yard provides better performance especially for big Datasets.
> ...
>> The Clerezza  is fine for smaller data sets.
>
> Do you have a "magic number" (a vague will be fine :) ) that define the
> limit for a big dataset ?

The SolrYard implementation should be pretty scalable (tens or
hundreds millions of entities). The ClerezzaYard will suffer from a
limitation though. It won't be scalable to more than a couple of
thousands of entities as long as the following is not fixed:

  https://issues.apache.org/jira/browse/CLEREZZA-466

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel