You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Viktor Gal <wi...@maeth.com> on 2013/11/20 18:02:14 UTC

distributed indexing

Hi,

i've just started to use stanbol about a week ago and i must say it's a great tool! kudos to all the developers!

i'm now trying to import and index the latest freebase data set and one thing came into my mind that maybe it would be great to add other indexer engine interfaces to stanbol, that can handle large corpora like http://terrier.org/

as terrier is mapreduce based (i.e. hadoop) it'd be great to have a mapred based RDF storage and this way we could easily calculate for example real PageRank values on the freebase data set by using mahout's pagerank implementation.

anybody maybe knows a good mapred based RDF storage? i've seen some people talking about HBase...

of course this would require some work both in terrier and mahout, but then again for data sets like freebase this would make a lot of things faster/easier (if one has the cluster for it).

happy to see comments on this!

cheers,
viktor

Re: distributed indexing

Posted by Viktor Gal <wi...@maeth.com>.

Hi Rafa

On 20/11/2013, at 7:23 pm, Rafa Haro <rh...@apache.org> wrote:

> Hi Viktor and welcome to the Apache Stanbol community
> 
> El 20/11/13 18:02, Viktor Gal escribió:
>> Hi,
>> 
>> i've just started to use stanbol about a week ago and i must say it's a great tool! kudos to all the developers!
>> 
>> i'm now trying to import and index the latest freebase data set and one thing came into my mind that maybe it would be great to add other indexer engine interfaces to stanbol, that can handle large corpora like http://terrier.org/
> With the current indexer, you are going to need a highly equipped machine (preferably with SSD disks and/or several GBs of RAM) for building the site. Rupert can give you more details but, AFAIK, first of all, you would need a lot of RAM for the entity scoring step. After that, all the triples are first stored in a JenaTDB based triple store (which implies a huge load of I/O disk operations) in order to allow some pre-processing (like LDPath based entity filtering) before finally indexing the entities in a Yard. So, the computation problem occurs while storing the triples in JenaTDB and not while indexing the entities in a Yard (at least with a SolrYard).

heheh yeah i've been through this. lucky that i had SSD around me to do the task as initially i've started with a simple 7200RPM hdd, and it would have taken ages... this way it was only 1.5 day ;P

> Initially, site building (indexing) is not a task that you usually need to do very often, therefore, in my honest opinion, I don't know if it worth to have a distributed process for it and after indexing, current yards seems to be performing very well for searching. Also with last versions of Solr or SolrCloud, it is possible to distribute the index.

the idea actually came when i was talking with Rupert about generating the incoming links file for freebase. He told me that it would be much better if we could actually calculate PageRank for the pages instead of the current ./fbranking.sh shell script.
That's when i thought about using mahout to do as it wouldn't be feasible with other libraries for such data set.
and since that would require anyhow storing the RDFs on HDFS, that's where the idea came across my mind that in that case we could actually use some sort of HDFS based storage (HBase?) to store the RDFs and then of course we could even use a hadoop based indexer, like terrier.

about the JenaTDB bottleneck: if the raw RDFs would reside on HDFS, and there would be a HDFS based triple store, then one could use map-reduce to load them in parallel. i.e. split up the data set among the hadoop nodes and let them load their part into the distributed tripe store.

cheers,
viktor

>> as terrier is mapreduce based (i.e. hadoop) it'd be great to have a mapred based RDF storage and this way we could easily calculate for example real PageRank values on the freebase data set by using mahout's pagerank implementation.
>> 
>> anybody maybe knows a good mapred based RDF storage? i've seen some people talking about HBase...
> That would be very nice in my opinion, although I'm still not sure about two things: how would a distributed triple store work and if that will really solve the storing problem. So far, we have experimented with graph databases like Neo4J providing RDF-store capabilities through Blueprints Sail Implementation [1]. TitanDB and OrientDB are examples of distributed graph databases also with Blueprints implementations, but we haven't tried them yet.
> 
> Regarding the JenaTDB bottleneck problem, I have been working on a workaround for indexing the entities in a Yard without passing through the triple store, something like Streaming indexing: from the dump directly to the Yard. It implies that you are not going to be able to do some kind of pre-processing like LDPath filtering or transformations, but if you don't need it, the indexing time is significantly reduced. I should have committed it today but currently I'm having issues with my Maven version for building Stanbol so, as soon as I solve them I will do it. It would be nice if someone else can test it.
> 
> Regards,
> Rafa
> 
> [1] - https://github.com/tinkerpop/blueprints/wiki/Sail-Implementation
>> 
>> of course this would require some work both in terrier and mahout, but then again for data sets like freebase this would make a lot of things faster/easier (if one has the cluster for it).
>> 
>> happy to see comments on this!
>> 
>> cheers,
>> viktor
>> 
>

Re: distributed indexing

Posted by Rafa Haro <rh...@apache.org>.

Hi Viktor and welcome to the Apache Stanbol community

El 20/11/13 18:02, Viktor Gal escribió:
> Hi,
>
> i've just started to use stanbol about a week ago and i must say it's a great tool! kudos to all the developers!
>
> i'm now trying to import and index the latest freebase data set and one thing came into my mind that maybe it would be great to add other indexer engine interfaces to stanbol, that can handle large corpora like http://terrier.org/
With the current indexer, you are going to need a highly equipped 
machine (preferably with SSD disks and/or several GBs of RAM) for 
building the site. Rupert can give you more details but, AFAIK, first of 
all, you would need a lot of RAM for the entity scoring step. After 
that, all the triples are first stored in a JenaTDB based triple store 
(which implies a huge load of I/O disk operations) in order to allow 
some pre-processing (like LDPath based entity filtering) before finally 
indexing the entities in a Yard. So, the computation problem occurs 
while storing the triples in JenaTDB and not while indexing the entities 
in a Yard (at least with a SolrYard).

Initially, site building (indexing) is not a task that you usually need 
to do very often, therefore, in my honest opinion, I don't know if it 
worth to have a distributed process for it and after indexing, current 
yards seems to be performing very well for searching. Also with last 
versions of Solr or SolrCloud, it is possible to distribute the index.
>
> as terrier is mapreduce based (i.e. hadoop) it'd be great to have a mapred based RDF storage and this way we could easily calculate for example real PageRank values on the freebase data set by using mahout's pagerank implementation.
>
> anybody maybe knows a good mapred based RDF storage? i've seen some people talking about HBase...
That would be very nice in my opinion, although I'm still not sure about 
two things: how would a distributed triple store work and if that will 
really solve the storing problem. So far, we have experimented with 
graph databases like Neo4J providing RDF-store capabilities through 
Blueprints Sail Implementation [1]. TitanDB and OrientDB are examples of 
distributed graph databases also with Blueprints implementations, but we 
haven't tried them yet.

Regarding the JenaTDB bottleneck problem, I have been working on a 
workaround for indexing the entities in a Yard without passing through 
the triple store, something like Streaming indexing: from the dump 
directly to the Yard. It implies that you are not going to be able to do 
some kind of pre-processing like LDPath filtering or transformations, 
but if you don't need it, the indexing time is significantly reduced. I 
should have committed it today but currently I'm having issues with my 
Maven version for building Stanbol so, as soon as I solve them I will do 
it. It would be nice if someone else can test it.

Regards,
Rafa

[1] - https://github.com/tinkerpop/blueprints/wiki/Sail-Implementation
>
> of course this would require some work both in terrier and mahout, but then again for data sets like freebase this would make a lot of things faster/easier (if one has the cluster for it).
>
> happy to see comments on this!
>
> cheers,
> viktor
>