You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Claude Warren <cl...@xenei.com> on 2017/12/08 07:27:19 UTC

Finding differences between graphs (Was: Jena/Fuseki graph sync)

On Fri, Nov 24, 2017 at 12:19 PM, Laura Morales <la...@mail.com> wrote:

> > What about simply deleting the old graph and loading the triples of the
> > .nt file into the graph afterwards? I don't see any benefit of such a
> > "tool" - you could just write your own bash script for this if you need
> > this quite often.
>
> The advantage is with large graphs, such as wikidata. If I download their
> dumps once a week, it's much more efficient to only change a few triples
> instead of deleting the entire graph and recreating the whole TDB store.
>

Performing a diff between two graphs with blank nodes might be speed up
using bloom filters.

I have code that represents triples as bloom filters and I know that 9 byte
filters will work for very large graphs so you could probably get aways
with 8 bytes to make them fit in a standard integer size.

This is a multiple pass operation.

create a bloom filter for each node in graph A.  Call this list A

step through  graph B creating bloom filters for each triple. if the triple
in question has blank nodes only encode non blank nodes

If the bloom filter is not in List A it is new.

if the bloom filter is in list A then it may be new and a direct lookup in
graph A. if it is not found add it

If your filter list has a pointer to the triples that it represents
(remember there can be bloom filter collisions) then you can rapidly
determine if there is a match and you also have a good starting place to do
blank node comparisons to determine if the triples are equivalent.

If anyone is interested in trying this I have some triple/bloom filter code
in my github repository.

Claude

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: Finding differences between graphs (Was: Jena/Fuseki graph sync)

Posted by Claude Warren <cl...@xenei.com>.

https://github.com/Claudenw/BloomFilter

https://github.com/Claudenw/BloomGraph

Both are just playground code.  I would be happy to discuss any issues with
you.

Claude

On 8 Dec 2017 08:43, "Laura Morales" <la...@mail.com> wrote:

> > If anyone is interested in trying this I have some triple/bloom filter
> code
> > in my github repository.
>
> Link?
>

Re: Finding differences between graphs (Was: Jena/Fuseki graph sync)

Posted by Laura Morales <la...@mail.com>.

> If anyone is interested in trying this I have some triple/bloom filter code
> in my github repository.

Link?

Re: Finding differences between graphs (Was: Jena/Fuseki graph sync)

Posted by Andy Seaborne <an...@apache.org>.

Graph.isIsomorphicWith(Graph g);
Model.isIsomorphicWith(Model m);



On 09/12/17 06:08, Claude Warren wrote:
> I am not sure what digest you mean but the fuaeki graphs do have an
> isomorphism check so you could compare compare construct results.
> 
> Claude
> 
> On 9 Dec 2017 04:11, "Dan Davis" <da...@gmail.com> wrote:
> 
>> So, this is what I was asking about earlier.  With small graphs, e.g.
>> DESCRIBE <....>, the algorithms for graph isomorphism that support blank
>> nodes should be good.   rdflib includes an implementation, and I wish I
>> knew whether there is an implementation of that digest algorithm for Jena.
>>
>> On Fri, Dec 8, 2017 at 2:27 AM, Claude Warren <cl...@xenei.com> wrote:
>>
>>> On Fri, Nov 24, 2017 at 12:19 PM, Laura Morales <la...@mail.com>
>> wrote:
>>>
>>>>> What about simply deleting the old graph and loading the triples of
>> the
>>>>> .nt file into the graph afterwards? I don't see any benefit of such a
>>>>> "tool" - you could just write your own bash script for this if you
>> need
>>>>> this quite often.
>>>>
>>>> The advantage is with large graphs, such as wikidata. If I download
>> their
>>>> dumps once a week, it's much more efficient to only change a few
>> triples
>>>> instead of deleting the entire graph and recreating the whole TDB
>> store.
>>>>
>>>
>>>
>>> Performing a diff between two graphs with blank nodes might be speed up
>>> using bloom filters.
>>>
>>> I have code that represents triples as bloom filters and I know that 9
>> byte
>>> filters will work for very large graphs so you could probably get aways
>>> with 8 bytes to make them fit in a standard integer size.
>>>
>>> This is a multiple pass operation.
>>>
>>> create a bloom filter for each node in graph A.  Call this list A
>>>
>>> step through  graph B creating bloom filters for each triple. if the
>> triple
>>> in question has blank nodes only encode non blank nodes
>>>
>>> If the bloom filter is not in List A it is new.
>>>
>>> if the bloom filter is in list A then it may be new and a direct lookup
>> in
>>> graph A. if it is not found add it
>>>
>>> If your filter list has a pointer to the triples that it represents
>>> (remember there can be bloom filter collisions) then you can rapidly
>>> determine if there is a match and you also have a good starting place to
>> do
>>> blank node comparisons to determine if the triples are equivalent.
>>>
>>> If anyone is interested in trying this I have some triple/bloom filter
>> code
>>> in my github repository.
>>>
>>> Claude
>>>
>>> --
>>> I like: Like Like - The likeliest place on the web
>>> <http://like-like.xenei.com>
>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>>
>>
>

Re: Finding differences between graphs (Was: Jena/Fuseki graph sync)

Posted by Claude Warren <cl...@xenei.com>.

I am not sure what digest you mean but the fuaeki graphs do have an
isomorphism check so you could compare compare construct results.

Claude

On 9 Dec 2017 04:11, "Dan Davis" <da...@gmail.com> wrote:

> So, this is what I was asking about earlier.  With small graphs, e.g.
> DESCRIBE <....>, the algorithms for graph isomorphism that support blank
> nodes should be good.   rdflib includes an implementation, and I wish I
> knew whether there is an implementation of that digest algorithm for Jena.
>
> On Fri, Dec 8, 2017 at 2:27 AM, Claude Warren <cl...@xenei.com> wrote:
>
> > On Fri, Nov 24, 2017 at 12:19 PM, Laura Morales <la...@mail.com>
> wrote:
> >
> > > > What about simply deleting the old graph and loading the triples of
> the
> > > > .nt file into the graph afterwards? I don't see any benefit of such a
> > > > "tool" - you could just write your own bash script for this if you
> need
> > > > this quite often.
> > >
> > > The advantage is with large graphs, such as wikidata. If I download
> their
> > > dumps once a week, it's much more efficient to only change a few
> triples
> > > instead of deleting the entire graph and recreating the whole TDB
> store.
> > >
> >
> >
> > Performing a diff between two graphs with blank nodes might be speed up
> > using bloom filters.
> >
> > I have code that represents triples as bloom filters and I know that 9
> byte
> > filters will work for very large graphs so you could probably get aways
> > with 8 bytes to make them fit in a standard integer size.
> >
> > This is a multiple pass operation.
> >
> > create a bloom filter for each node in graph A.  Call this list A
> >
> > step through  graph B creating bloom filters for each triple. if the
> triple
> > in question has blank nodes only encode non blank nodes
> >
> > If the bloom filter is not in List A it is new.
> >
> > if the bloom filter is in list A then it may be new and a direct lookup
> in
> > graph A. if it is not found add it
> >
> > If your filter list has a pointer to the triples that it represents
> > (remember there can be bloom filter collisions) then you can rapidly
> > determine if there is a match and you also have a good starting place to
> do
> > blank node comparisons to determine if the triples are equivalent.
> >
> > If anyone is interested in trying this I have some triple/bloom filter
> code
> > in my github repository.
> >
> > Claude
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
>

Re: Finding differences between graphs (Was: Jena/Fuseki graph sync)

Posted by Dan Davis <da...@gmail.com>.

So, this is what I was asking about earlier.  With small graphs, e.g.
DESCRIBE <....>, the algorithms for graph isomorphism that support blank
nodes should be good.   rdflib includes an implementation, and I wish I
knew whether there is an implementation of that digest algorithm for Jena.

On Fri, Dec 8, 2017 at 2:27 AM, Claude Warren <cl...@xenei.com> wrote:

> On Fri, Nov 24, 2017 at 12:19 PM, Laura Morales <la...@mail.com> wrote:
>
> > > What about simply deleting the old graph and loading the triples of the
> > > .nt file into the graph afterwards? I don't see any benefit of such a
> > > "tool" - you could just write your own bash script for this if you need
> > > this quite often.
> >
> > The advantage is with large graphs, such as wikidata. If I download their
> > dumps once a week, it's much more efficient to only change a few triples
> > instead of deleting the entire graph and recreating the whole TDB store.
> >
>
>
> Performing a diff between two graphs with blank nodes might be speed up
> using bloom filters.
>
> I have code that represents triples as bloom filters and I know that 9 byte
> filters will work for very large graphs so you could probably get aways
> with 8 bytes to make them fit in a standard integer size.
>
> This is a multiple pass operation.
>
> create a bloom filter for each node in graph A.  Call this list A
>
> step through  graph B creating bloom filters for each triple. if the triple
> in question has blank nodes only encode non blank nodes
>
> If the bloom filter is not in List A it is new.
>
> if the bloom filter is in list A then it may be new and a direct lookup in
> graph A. if it is not found add it
>
> If your filter list has a pointer to the triples that it represents
> (remember there can be bloom filter collisions) then you can rapidly
> determine if there is a match and you also have a good starting place to do
> blank node comparisons to determine if the triples are equivalent.
>
> If anyone is interested in trying this I have some triple/bloom filter code
> in my github repository.
>
> Claude
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>