You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by "yphu@zju.edu.cn" <yp...@zju.edu.cn> on 2018/12/07 14:45:37 UTC

sparql 1.4 billion triples

Dear jena,
I have built a graph with 1.4 billion triples and store it as a data set in TDB  through Fuseki upload system.
Now, I try to make some sparql search, the speed is very slow.

For example, when I make the sqarql in Fuseki in the following, it takes 50 seconds.
How can I improve the speed?


Best wishes!


胡云苹
浙江大学控制科学与工程学院 
浙江省杭州市浙大路38号浙大玉泉校区CSC研究所
Institute of Cyber-Systems and Control, College of Control Science and  Engineering, Zhejiang University, Hangzhou 310027,P.R.China
Email : yphu@zju.edu.cn;hyphyp28@163.com

Re: sparql 1.4 billion triples

Posted by Dick Murray <da...@gmail.com>.

Be very careful using vmtouch especially if you call -dl as you could very
easily and quickly kill a system. I've used this tool on cloud VM's to
mitigate cycle times, think DBAN due to public nature of hardware. It's a
fast way to an irked OS thrashing around.

Dick

On Sun, 16 Dec 2018 19:57 Siddhesh Rane <kingsid911@gmail.com wrote:

> I'll be happy to document this. I think FAQ would be a good place.
>
> I actually looked further into this and found that the vmtouch
> functionality is provided in the jdk itself.
> java.nio.MappedByteBuffer#load method will bring file pages in memory [1].
> The way it works is similar to vmtouch, i.e. reading a byte from each page
> to cause page fault and load that page in memory [2].
>
> [1]
>
> https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html#load--
>
> [2]
>
> http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/nio/MappedByteBuffer.java#l156
>
>
> On Sun, 16 Dec 2018, 6:59 pm ajs6f <ajs6f@apache.org wrote:
>
> > This seems to be a Linux-only technique that relies on installing and
> > maintaining vmtouch, correct?
> >
> > It doesn't seem that we could support that as a general solution, but
> > would you be interested in writing something that gives the essentials up
> > for someplace in the Jena docs? I'll admit I'm not sure where it would
> best
> > go, but it might be very helpful to users who can take advantage of it.
> >
> > ajs6f
> >
> > > On Dec 16, 2018, at 6:11 AM, Siddhesh Rane <ki...@gmail.com>
> wrote:
> > >
> > > In-memory database has following limitations :
> > >
> > > 1) Time to create the database. Not a problem if you have a dedicated
> > > machine which runs 24/7 where you load data once and the process never
> > > exits. But a huge waste of time if you get hardware during certain time
> > > slots and you have to load data from the start.
> > >
> > > 2) In-memory database is all or nothing. If your dataset can't fit in
> > RAM,
> > > you are out of luck. I had tried using this but many times it would go
> > OOM.
> > > With vmtouch, you can load an index partially, until as much free RAM
> is
> > > available. Something is better than nothing.
> > >
> > > Vmtouch is not doing anything magical. Tdb already uses mmap. When run
> on
> > > its own, Linux will bring most of the index in RAM. But think about the
> > > time it will take for that to happen. If one query takes 50 seconds
> (I've
> > > seen it go to 500-1000s as well), then in 1 hour you would have run
> just
> > 72
> > > queries. If instead your speed was 1s/query you would have executed
> 3600
> > > queries and that would bring more of the index in RAM for future
> queries
> > to
> > > run fast as well. So its also the rate of speedup that matters.
> > > With vmtouch, you vmtouch at the beginning and it gives you a fast head
> > > start and then its your program maintaining the cache.
> > >
> > > Regards,
> > > Siddhesh
> > >
> > >
> > > On Sat, 15 Dec 2018, 9:15 pm ajs6f <ajs6f@apache.org wrote:
> > >
> > >> What is the advantage to doing that as opposed to using Jena's
> built-in
> > >> in-memory dataset?
> > >>
> > >> ajs6f
> > >>
> > >>> On Dec 15, 2018, at 3:04 AM, Siddhesh Rane <ki...@gmail.com>
> > wrote:
> > >>>
> > >>> Bring the entire database in RAM.
> > >>> Use "vmtouch <database location>"
> > >>> Get vmtouch from https://hoytech.com/vmtouch/
> > >>>
> > >>> I had used jena for 150M triples and my performance findings are
> > >> documented
> > >>> at
> > >>>
> > >>
> >
> https://lists.apache.org/thread.html/254968eee3cd04370eafa2f9cc586e238f8a7034cf9ab4cbde3dc8e9@%3Cusers.jena.apache.org%3E
> > >>>
> > >>> Regards,
> > >>> Siddhesh
> > >>>
> > >>> On Fri, 7 Dec 2018, 8:23 pm yphu@zju.edu.cn <yphu@zju.edu.cn wrote:
> > >>>
> > >>>> Dear jena,
> > >>>> I have built a graph with 1.4 billion triples and store it as a data
> > set
> > >>>> in TDB  through Fuseki upload system.
> > >>>> Now, I try to make some sparql search, the speed is very slow.
> > >>>>
> > >>>> For example, when I make the sqarql in Fuseki in the following, it
> > takes
> > >>>> 50 seconds.
> > >>>> How can I improve the speed?
> > >>>> ------------------------------
> > >>>> Best wishes!
> > >>>>
> > >>>>
> > >>>> 胡云苹
> > >>>> 浙江大学控制科学与工程学院
> > >>>> 浙江省杭州市浙大路38号浙大玉泉校区CSC研究所
> > >>>> Institute of Cyber-Systems and Control, College of Control Science
> and
> > >>>> Engineering, Zhejiang University, Hangzhou 310027,P.R.China
> > >>>> Email : yphu@zju.edu.cn <yp...@iipc.zju.edu.cn>;hyphyp28@163.com
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: sparql 1.4 billion triples

Posted by Siddhesh Rane <ki...@gmail.com>.

I'll be happy to document this. I think FAQ would be a good place.

I actually looked further into this and found that the vmtouch
functionality is provided in the jdk itself.
java.nio.MappedByteBuffer#load method will bring file pages in memory [1].
The way it works is similar to vmtouch, i.e. reading a byte from each page
to cause page fault and load that page in memory [2].

[1]
https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html#load--

[2]
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/nio/MappedByteBuffer.java#l156


On Sun, 16 Dec 2018, 6:59 pm ajs6f <ajs6f@apache.org wrote:

> This seems to be a Linux-only technique that relies on installing and
> maintaining vmtouch, correct?
>
> It doesn't seem that we could support that as a general solution, but
> would you be interested in writing something that gives the essentials up
> for someplace in the Jena docs? I'll admit I'm not sure where it would best
> go, but it might be very helpful to users who can take advantage of it.
>
> ajs6f
>
> > On Dec 16, 2018, at 6:11 AM, Siddhesh Rane <ki...@gmail.com> wrote:
> >
> > In-memory database has following limitations :
> >
> > 1) Time to create the database. Not a problem if you have a dedicated
> > machine which runs 24/7 where you load data once and the process never
> > exits. But a huge waste of time if you get hardware during certain time
> > slots and you have to load data from the start.
> >
> > 2) In-memory database is all or nothing. If your dataset can't fit in
> RAM,
> > you are out of luck. I had tried using this but many times it would go
> OOM.
> > With vmtouch, you can load an index partially, until as much free RAM is
> > available. Something is better than nothing.
> >
> > Vmtouch is not doing anything magical. Tdb already uses mmap. When run on
> > its own, Linux will bring most of the index in RAM. But think about the
> > time it will take for that to happen. If one query takes 50 seconds (I've
> > seen it go to 500-1000s as well), then in 1 hour you would have run just
> 72
> > queries. If instead your speed was 1s/query you would have executed 3600
> > queries and that would bring more of the index in RAM for future queries
> to
> > run fast as well. So its also the rate of speedup that matters.
> > With vmtouch, you vmtouch at the beginning and it gives you a fast head
> > start and then its your program maintaining the cache.
> >
> > Regards,
> > Siddhesh
> >
> >
> > On Sat, 15 Dec 2018, 9:15 pm ajs6f <ajs6f@apache.org wrote:
> >
> >> What is the advantage to doing that as opposed to using Jena's built-in
> >> in-memory dataset?
> >>
> >> ajs6f
> >>
> >>> On Dec 15, 2018, at 3:04 AM, Siddhesh Rane <ki...@gmail.com>
> wrote:
> >>>
> >>> Bring the entire database in RAM.
> >>> Use "vmtouch <database location>"
> >>> Get vmtouch from https://hoytech.com/vmtouch/
> >>>
> >>> I had used jena for 150M triples and my performance findings are
> >> documented
> >>> at
> >>>
> >>
> https://lists.apache.org/thread.html/254968eee3cd04370eafa2f9cc586e238f8a7034cf9ab4cbde3dc8e9@%3Cusers.jena.apache.org%3E
> >>>
> >>> Regards,
> >>> Siddhesh
> >>>
> >>> On Fri, 7 Dec 2018, 8:23 pm yphu@zju.edu.cn <yphu@zju.edu.cn wrote:
> >>>
> >>>> Dear jena,
> >>>> I have built a graph with 1.4 billion triples and store it as a data
> set
> >>>> in TDB  through Fuseki upload system.
> >>>> Now, I try to make some sparql search, the speed is very slow.
> >>>>
> >>>> For example, when I make the sqarql in Fuseki in the following, it
> takes
> >>>> 50 seconds.
> >>>> How can I improve the speed?
> >>>> ------------------------------
> >>>> Best wishes!
> >>>>
> >>>>
> >>>> 胡云苹
> >>>> 浙江大学控制科学与工程学院
> >>>> 浙江省杭州市浙大路38号浙大玉泉校区CSC研究所
> >>>> Institute of Cyber-Systems and Control, College of Control Science and
> >>>> Engineering, Zhejiang University, Hangzhou 310027,P.R.China
> >>>> Email : yphu@zju.edu.cn <yp...@iipc.zju.edu.cn>;hyphyp28@163.com
> >>>>
> >>>>
> >>
> >>
>
>

Re: sparql 1.4 billion triples

Posted by ajs6f <aj...@apache.org>.

This seems to be a Linux-only technique that relies on installing and maintaining vmtouch, correct?

It doesn't seem that we could support that as a general solution, but would you be interested in writing something that gives the essentials up for someplace in the Jena docs? I'll admit I'm not sure where it would best go, but it might be very helpful to users who can take advantage of it.

ajs6f

> On Dec 16, 2018, at 6:11 AM, Siddhesh Rane <ki...@gmail.com> wrote:
> 
> In-memory database has following limitations :
> 
> 1) Time to create the database. Not a problem if you have a dedicated
> machine which runs 24/7 where you load data once and the process never
> exits. But a huge waste of time if you get hardware during certain time
> slots and you have to load data from the start.
> 
> 2) In-memory database is all or nothing. If your dataset can't fit in RAM,
> you are out of luck. I had tried using this but many times it would go OOM.
> With vmtouch, you can load an index partially, until as much free RAM is
> available. Something is better than nothing.
> 
> Vmtouch is not doing anything magical. Tdb already uses mmap. When run on
> its own, Linux will bring most of the index in RAM. But think about the
> time it will take for that to happen. If one query takes 50 seconds (I've
> seen it go to 500-1000s as well), then in 1 hour you would have run just 72
> queries. If instead your speed was 1s/query you would have executed 3600
> queries and that would bring more of the index in RAM for future queries to
> run fast as well. So its also the rate of speedup that matters.
> With vmtouch, you vmtouch at the beginning and it gives you a fast head
> start and then its your program maintaining the cache.
> 
> Regards,
> Siddhesh
> 
> 
> On Sat, 15 Dec 2018, 9:15 pm ajs6f <ajs6f@apache.org wrote:
> 
>> What is the advantage to doing that as opposed to using Jena's built-in
>> in-memory dataset?
>> 
>> ajs6f
>> 
>>> On Dec 15, 2018, at 3:04 AM, Siddhesh Rane <ki...@gmail.com> wrote:
>>> 
>>> Bring the entire database in RAM.
>>> Use "vmtouch <database location>"
>>> Get vmtouch from https://hoytech.com/vmtouch/
>>> 
>>> I had used jena for 150M triples and my performance findings are
>> documented
>>> at
>>> 
>> https://lists.apache.org/thread.html/254968eee3cd04370eafa2f9cc586e238f8a7034cf9ab4cbde3dc8e9@%3Cusers.jena.apache.org%3E
>>> 
>>> Regards,
>>> Siddhesh
>>> 
>>> On Fri, 7 Dec 2018, 8:23 pm yphu@zju.edu.cn <yphu@zju.edu.cn wrote:
>>> 
>>>> Dear jena,
>>>> I have built a graph with 1.4 billion triples and store it as a data set
>>>> in TDB  through Fuseki upload system.
>>>> Now, I try to make some sparql search, the speed is very slow.
>>>> 
>>>> For example, when I make the sqarql in Fuseki in the following, it takes
>>>> 50 seconds.
>>>> How can I improve the speed?
>>>> ------------------------------
>>>> Best wishes!
>>>> 
>>>> 
>>>> 胡云苹
>>>> 浙江大学控制科学与工程学院
>>>> 浙江省杭州市浙大路38号浙大玉泉校区CSC研究所
>>>> Institute of Cyber-Systems and Control, College of Control Science and
>>>> Engineering, Zhejiang University, Hangzhou 310027,P.R.China
>>>> Email : yphu@zju.edu.cn <yp...@iipc.zju.edu.cn>;hyphyp28@163.com
>>>> 
>>>> 
>> 
>>

Re: sparql 1.4 billion triples

Posted by Siddhesh Rane <ki...@gmail.com>.

In-memory database has following limitations :

1) Time to create the database. Not a problem if you have a dedicated
machine which runs 24/7 where you load data once and the process never
exits. But a huge waste of time if you get hardware during certain time
slots and you have to load data from the start.

2) In-memory database is all or nothing. If your dataset can't fit in RAM,
you are out of luck. I had tried using this but many times it would go OOM.
With vmtouch, you can load an index partially, until as much free RAM is
available. Something is better than nothing.

Vmtouch is not doing anything magical. Tdb already uses mmap. When run on
its own, Linux will bring most of the index in RAM. But think about the
time it will take for that to happen. If one query takes 50 seconds (I've
seen it go to 500-1000s as well), then in 1 hour you would have run just 72
queries. If instead your speed was 1s/query you would have executed 3600
queries and that would bring more of the index in RAM for future queries to
run fast as well. So its also the rate of speedup that matters.
With vmtouch, you vmtouch at the beginning and it gives you a fast head
start and then its your program maintaining the cache.

Regards,
Siddhesh

On Sat, 15 Dec 2018, 9:15 pm ajs6f <ajs6f@apache.org wrote:

> What is the advantage to doing that as opposed to using Jena's built-in
> in-memory dataset?
>
> ajs6f
>
> > On Dec 15, 2018, at 3:04 AM, Siddhesh Rane <ki...@gmail.com> wrote:
> >
> > Bring the entire database in RAM.
> > Use "vmtouch <database location>"
> > Get vmtouch from https://hoytech.com/vmtouch/
> >
> > I had used jena for 150M triples and my performance findings are
> documented
> > at
> >
> https://lists.apache.org/thread.html/254968eee3cd04370eafa2f9cc586e238f8a7034cf9ab4cbde3dc8e9@%3Cusers.jena.apache.org%3E
> >
> > Regards,
> > Siddhesh
> >
> > On Fri, 7 Dec 2018, 8:23 pm yphu@zju.edu.cn <yphu@zju.edu.cn wrote:
> >
> >> Dear jena,
> >> I have built a graph with 1.4 billion triples and store it as a data set
> >> in TDB  through Fuseki upload system.
> >> Now, I try to make some sparql search, the speed is very slow.
> >>
> >> For example, when I make the sqarql in Fuseki in the following, it takes
> >> 50 seconds.
> >> How can I improve the speed?
> >> ------------------------------
> >> Best wishes!
> >>
> >>
> >> 胡云苹
> >> 浙江大学控制科学与工程学院
> >> 浙江省杭州市浙大路38号浙大玉泉校区CSC研究所
> >> Institute of Cyber-Systems and Control, College of Control Science and
> >> Engineering, Zhejiang University, Hangzhou 310027,P.R.China
> >> Email : yphu@zju.edu.cn <yp...@iipc.zju.edu.cn>;hyphyp28@163.com
> >>
> >>
>
>

Re: sparql 1.4 billion triples

Posted by ajs6f <aj...@apache.org>.

What is the advantage to doing that as opposed to using Jena's built-in in-memory dataset?

ajs6f

> On Dec 15, 2018, at 3:04 AM, Siddhesh Rane <ki...@gmail.com> wrote:
> 
> Bring the entire database in RAM.
> Use "vmtouch <database location>"
> Get vmtouch from https://hoytech.com/vmtouch/
> 
> I had used jena for 150M triples and my performance findings are documented
> at
> https://lists.apache.org/thread.html/254968eee3cd04370eafa2f9cc586e238f8a7034cf9ab4cbde3dc8e9@%3Cusers.jena.apache.org%3E
> 
> Regards,
> Siddhesh
> 
> On Fri, 7 Dec 2018, 8:23 pm yphu@zju.edu.cn <yphu@zju.edu.cn wrote:
> 
>> Dear jena,
>> I have built a graph with 1.4 billion triples and store it as a data set
>> in TDB  through Fuseki upload system.
>> Now, I try to make some sparql search, the speed is very slow.
>> 
>> For example, when I make the sqarql in Fuseki in the following, it takes
>> 50 seconds.
>> How can I improve the speed?
>> ------------------------------
>> Best wishes!
>> 
>> 
>> 胡云苹
>> 浙江大学控制科学与工程学院
>> 浙江省杭州市浙大路38号浙大玉泉校区CSC研究所
>> Institute of Cyber-Systems and Control, College of Control Science and
>> Engineering, Zhejiang University, Hangzhou 310027,P.R.China
>> Email : yphu@zju.edu.cn <yp...@iipc.zju.edu.cn>;hyphyp28@163.com
>> 
>>

Re: sparql 1.4 billion triples

Posted by Siddhesh Rane <ki...@gmail.com>.

Bring the entire database in RAM.
Use "vmtouch <database location>"
Get vmtouch from https://hoytech.com/vmtouch/

I had used jena for 150M triples and my performance findings are documented
at
https://lists.apache.org/thread.html/254968eee3cd04370eafa2f9cc586e238f8a7034cf9ab4cbde3dc8e9@%3Cusers.jena.apache.org%3E

Regards,
Siddhesh

On Fri, 7 Dec 2018, 8:23 pm yphu@zju.edu.cn <yphu@zju.edu.cn wrote:

> Dear jena,
> I have built a graph with 1.4 billion triples and store it as a data set
> in TDB  through Fuseki upload system.
> Now, I try to make some sparql search, the speed is very slow.
>
> For example, when I make the sqarql in Fuseki in the following, it takes
> 50 seconds.
> How can I improve the speed?
> ------------------------------
> Best wishes!
>
>
> 胡云苹
> 浙江大学控制科学与工程学院
> 浙江省杭州市浙大路38号浙大玉泉校区CSC研究所
> Institute of Cyber-Systems and Control, College of Control Science and
> Engineering, Zhejiang University, Hangzhou 310027,P.R.China
> Email : yphu@zju.edu.cn <yp...@iipc.zju.edu.cn>;hyphyp28@163.com
>
>

Re: sparql 1.4 billion triples

Posted by Jean-Marc Vanel <je...@gmail.com>.

yphu,
you didn't share your query.
Maybe the query have questionable features.

Did you try a simple but useful query, like getting the first 10
foaf:Person ?


Le ven. 7 déc. 2018 à 15:53, yphu@zju.edu.cn <yp...@zju.edu.cn> a écrit :

> Dear jena,
> I have built a graph with 1.4 billion triples and store it as a data set
> in TDB  through Fuseki upload system.
> Now, I try to make some sparql search, the speed is very slow.
>
> For example, when I make the sqarql in Fuseki in the following, it takes
> 50 seconds.
> How can I improve the speed?
> ------------------------------
> Best wishes!
>
>
> 胡云苹
> 浙江大学控制科学与工程学院
> 浙江省杭州市浙大路38号浙大玉泉校区CSC研究所
> Institute of Cyber-Systems and Control, College of Control Science and
> Engineering, Zhejiang University, Hangzhou 310027,P.R.China
> Email : yphu@zju.edu.cn <yp...@iipc.zju.edu.cn>;hyphyp28@163.com
>
>

-- 
Jean-Marc Vanel
http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me#subject
<http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
Rule-based programming, Semantic Web
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
 Chroniques jardin
<http://semantic-forms.cc:1952/backlinks?q=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle>