You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Olivier Grisel <ol...@ensta.org> on 2011/02/15 17:43:21 UTC

Setting up the entityhub in offline mode

Hi Rupert,

I would like to upgrade the https://stanbol.demo.nuxeo.org to include
the entityhub service in full offline /sandalone mode (with a local
solr index of DBpedia for instance). I would also like to be able to
upgrade the Nuxeo / Stanbol connector to use the entityhub API
(instead of the direct DBpedia dereferencer currently implemented).

Could you please tell me how to set this up (along with where to get a
copy of the precomputed solr index you are using for your tests)?
Ideally it would be best if you could update the README file in the
entityhub folder with those information.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: Setting up the entityhub in offline mode

Posted by Olivier Grisel <ol...@ensta.org>.

2011/2/15 Rupert Westenthaler <ru...@gmail.com>:
>> As soon as you have such an howto ready I would be glad to write a
>> bunch of pig scripts to build indexes for topics (rather than
>> entities) so as to be able to perform document level topic assignment
>> rather than occurrence-based entity lookups.
>>
> OK I do not really understand what you mean by that.

Ok let me explain. In the old autotagging enhancer, there is a tool
that does "more like this" similarity queries to find the main topic
of a complete document or paragraph (without first using opennlp to
find occurrences of names). To make this usable we need to build a
topic index from the top skos categories available in DBpedia. Each
category should contain a full-text indexed field with the aggregate
text content of the most popular article abstract of entities of that
category. That way more like this will be able to get that a document
is about "Economy of India" if it sees statistical significant terms
such as "rupee", "Tata Nano", "GDP", "Bangalore".

To build such an index we need to compute joins between. That would
take a lot of time to do it using a triple store so it's best IMHO to
use Apache Pig scripts run on a cluster of machine on EC2. This would
be similar to what I did to build new OpenNLP models here:

  http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: Setting up the entityhub in offline mode

Posted by Rupert Westenthaler <ru...@gmail.com>.

On Tue, Feb 15, 2011 at 8:05 PM, Olivier Grisel
<ol...@ensta.org> wrote:
> Great. Do you think it would be possible to have a default
> configuration for a small index of the top 10000 entities as measured
> by popularity?
Yes the indexer can be configured to build specialized indexes.
However to really make it easy to use I would need to implement some
improvements.

To use the current version have a look at the /entityhub/indexer/dbPedia bundle

(1) Use "mvn assembly:assembly" to build the jar with all dependency
(2) copy the jar to a different directory (because otherwise mvn clean
might delete some files you do not want to be deleted)
(3) use "java -jar
org.apache.stanbol.entityhub.indexing.dbPedia-0.1-SNAPSHOT-jar-with-dependencies.jar
-h" to see options

Parameters:
The first parameter is the URL of the Solr Core used for indexing. You
will want to configure an own core for the dbPedia index
The second parameter is the path to a directory with the RDF dump of
DBPedia. Files can be found on "http://wiki.dbpedia.org/Downloads36".
Download the files you need and put them into a directory. The indexer
will automatically all the files.

Options:
 -i : this can be used to provide the file with incomming links. You
should better know than I how to create such files, because you
provided me with the one I used to create my index
("incoming-counts.tsv"). Note that this file is based on an older
version of the dbPedia dump because of that newer entities will not be
ignored during indexing.
 -ri : the minimum number of incoming links required that an entity is
included within the index. This can be used to control the size of the
index.
 -s : This is very handy to resume the indexing if you have already
completed the importing of the RDF data.
 -r : Resume Mode. Can also be used to activate the entity ranking
based indexing mode (see NOTE below)

IMPORTANT NOTE: For building small indices (number of indexed entities
<< number of entities in the dataset) it will be faster to activate
the "-r" switch. The generic RDF Indexer has two modes how to iterate
over the entities in the dataset. First by iterating over all triples
and second by using the entity ranking (file parsed by the -i option).
The first method is ~5times faster than the second, but if one only
index a small subset of the entities the entity ranking based indexing
mode will still be more efficient.

On my laptop it needed around 3 days to build the index, but this was
mainly limited to the ~100 IO operation/sec of the hard disk.

>
> I am also thinking of building maven artifacts to embed the opennlp
> models in version 1.5 without checking them in the Stanbol svn repo. I
> could help you bundle a set of small entity indexes.
>

That would be a cool thing to do. I am specially interested to find a
good way to provide configurations for the Entityhub (especially to
provide a default config so that the entityhub can be used without any
required configuration).
Adding new Referenced Sites by copying special bundles to an config
directory (e.g. by using
http://felix.apache.org/site/apache-felix-file-install.html) would be
an other great thing to do.

> Also could you write a howto for building indexes? I think such howto
> should better be written as text file in the stanbol source tree or
> better as a new documentation page for the stanbol website (using the
> markdown syntax) rather than a new wikipage on the IKS wiki).
>
I do not plan to update the documentation on the IKS wiki.
Looking at the stanbol website and start to move/adapt existing
Documentation is on my TODO list since some weeks. However I fear that
I will only have time to start with this after the Semantic
Interaction Framework Hackathon February 24th-26th, in Vienna

> As soon as you have such an howto ready I would be glad to write a
> bunch of pig scripts to build indexes for topics (rather than
> entities) so as to be able to perform document level topic assignment
> rather than occurrence-based entity lookups.
>
OK I do not really understand what you mean by that.

best
Rupert

> --
> Olivier
>

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Setting up the entityhub in offline mode

Posted by Olivier Grisel <ol...@ensta.org>.

Great. Do you think it would be possible to have a default
configuration for a small index of the top 10000 entities as measured
by popularity?

I am also thinking of building maven artifacts to embed the opennlp
models in version 1.5 without checking them in the Stanbol svn repo. I
could help you bundle a set of small entity indexes.

Also could you write a howto for building indexes? I think such howto
should better be written as text file in the stanbol source tree or
better as a new documentation page for the stanbol website (using the
markdown syntax) rather than a new wikipage on the IKS wiki).

As soon as you have such an howto ready I would be glad to write a
bunch of pig scripts to build indexes for topics (rather than
entities) so as to be able to perform document level topic assignment
rather than occurrence-based entity lookups.

-- 
Olivier

Re: Setting up the entityhub in offline mode

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi Olivier

I will try to upload precomputed indexes on my server. I tried this
already to weeks ago, but encountered some problems with files greater
than 2GByte.
Tomorrow I will split them up in several files and look if it works.

Currently there are three precomputed solr indexes available
 - dbPedia
 - geonames
 - dblp (http://dblp.uni-trier.de/): created today based on a request
of Andreas Gruber

Solr Server configuration:
All precomputed indexes uses the SolrYard implementation. [1]
describes how to setup the SolrServer. Note that it is no longer
required to set up an own SolrServer because the SolrYard now also
support embedded solr server. To activate that just configure a file
path instead of a http URL for the "Solr Server URL".

The archives with the precomputed indexes will contain a Solr Core.
Such cores need to be configured within the solr.xml as descibed in
[3].

Entityhub Configuration:

To activate the usage of caches for Referenced Site one has to do two
additional steps

(1) Create the Cache Instance:
There are only two Options to set:
" ": Here you need to provide the ID of the Yard used by this cache
" ": This can be used to specify what data are stored in the cache.
This mappings are only used if Entities loaded form the remote source
are cached locally. So for precomputed full caches this can be empty.

(2) Referenced Site (see also [2]):
Two parameters need to be changed to tell the References Site to use a
configured cache
"Cache Strategy": To use a precomputed cache set this to "ALL"
"Cache ID": This is the ID of the Cache (the same as the ID of the Yard)

Note that when the "Cache Strategy" is set to ALL, than there is no
need to configure a "Dereferencer Impl" nor a "Searcher Impl". However
if configured they are used as fallback if the Cache is not active or
throws an error.

Predefined Entityhub configuration:

To make it easier I will provide a predefined configuration for the
entityhub. This can be copy/pased to the config directory within the
sling folder.
This will not provide plug and play functionality, but is rather
intended to act as an starting point:
Users will need to
 - adapt some properties (e.g. the Solr Server URL).
 - deactivate/delete Yards, Caches and References Sites they do not want.

I will provide the links to the indexes as soon as I was able to load
them up to my server.

best
Rupert Westenthaler

[1] http://wiki.iks-project.eu/index.php/SolrYardConfiguration
[2] http://wiki.iks-project.eu/index.php/ReferencedSiteConfiguration
[3] http://wiki.apache.org/solr/CoreAdmin#Configuration

On Tue, Feb 15, 2011 at 5:43 PM, Olivier Grisel
<ol...@ensta.org> wrote:
> Hi Rupert,
>
> I would like to upgrade the https://stanbol.demo.nuxeo.org to include
> the entityhub service in full offline /sandalone mode (with a local
> solr index of DBpedia for instance). I would also like to be able to
> upgrade the Nuxeo / Stanbol connector to use the entityhub API
> (instead of the direct DBpedia dereferencer currently implemented).
>
> Could you please tell me how to set this up (along with where to get a
> copy of the precomputed solr index you are using for your tests)?
> Ideally it would be best if you could update the README file in the
> entityhub folder with those information.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen