You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Andrea Di Menna <ni...@gmail.com> on 2012/11/16 14:18:32 UTC

Stanbol indexing tool

Hi,
I have a question regarding the different phases of the indexing
process using the indexing tool bundled in Stanbol.
If I am not wrong Stanbol will first create a TDB and after that will
build a Solr index.
The first part of the process seems slower on my machine w.r.t. to
loading triples in a TDB using directly tdbloader2 (Note: I am using
the latest available version of Jena when running tdloader2 standalone
- namely 2.7.4).

Is there any way to skip the TDB creation and jump to the second part,
so that I can create the TDB using the latest Jena available?

Cheers,
Andrea

Re: Stanbol indexing tool

Posted by Andrea Di Menna <ni...@gmail.com>.
Hi Rupert,

I have created a new index and everything seems to work ok.
I guess no changes in the binary data format have occurred between
2.6.3 and 2.7.4.

It took about 78 mins (TDB) + 35 mins (Solr) to process ~ 80M triples.
Moreover I did not have any memory issue w.r.t. to completing the
whole process using the EntityHub indexing tool.
Usually I had to restart the process at least twice because of
OutOfMemory exceptions.
Considering the fact I am using a machine with 16GB it seems there is
something wrong...

cheers
Andrea

2012/11/16 Rupert Westenthaler <ru...@gmail.com>:
> The TDB database is located under
>
>     {indexing-working-dir}/indexing/resources/tdb
>
> If you do have an TDB store with the required data, than you can
> provide them under that directory. Just make sure that the
>
>     {indexing-working-dir}/indexing/resources/rdfdata
>
> folder is empty when you start the tool. Otherwise the RDF files in
> that folder would get imported.
>
> On Fri, Nov 16, 2012 at 2:18 PM, Andrea Di Menna <ni...@gmail.com> wrote:
>> The first part of the process seems slower on my machine w.r.t. to
>> loading triples in a TDB using directly tdbloader2 (Note: I am using
>> the latest available version of Jena when running tdloader2 standalone
>> - namely 2.7.4).
>
> Yes the indexing tool uses
>
>     com.hp.hpl.jena:jena:2.6.3
>     com.hp.hpl.jena:arq:2.8.5
>     com.hp.hpl.jena:tdb:0.8.7
>
> but you could still try to use your datastore. Maybe they have not
> changed the binary format of the files.
>
> If not let me know and I will try to update the Jena Version used by
> the Indexing Tool
>
> best
> Rupert
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen

Re: Stanbol indexing tool

Posted by Rupert Westenthaler <ru...@gmail.com>.
The TDB database is located under

    {indexing-working-dir}/indexing/resources/tdb

If you do have an TDB store with the required data, than you can
provide them under that directory. Just make sure that the

    {indexing-working-dir}/indexing/resources/rdfdata

folder is empty when you start the tool. Otherwise the RDF files in
that folder would get imported.

On Fri, Nov 16, 2012 at 2:18 PM, Andrea Di Menna <ni...@gmail.com> wrote:
> The first part of the process seems slower on my machine w.r.t. to
> loading triples in a TDB using directly tdbloader2 (Note: I am using
> the latest available version of Jena when running tdloader2 standalone
> - namely 2.7.4).

Yes the indexing tool uses

    com.hp.hpl.jena:jena:2.6.3
    com.hp.hpl.jena:arq:2.8.5
    com.hp.hpl.jena:tdb:0.8.7

but you could still try to use your datastore. Maybe they have not
changed the binary format of the files.

If not let me know and I will try to update the Jena Version used by
the Indexing Tool

best
Rupert

--
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen