You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Laura Morales <la...@mail.com> on 2017/04/15 10:55:57 UTC

Very slow tdbloader2 insertion

I've made a dataset with about 10M nquads, 5-6 graphs, stored as a single .nq file.
I've launched tdbloader2 to create a new dataset from this file, but I see a constant and remarkable slow down as more nquads are added to the dataset. Here are some INFO during processing:

INFO  Add: 50,000 Data (Batch: 12,983 / Avg: 12,983)
INFO  Add: 500,000 Data (Batch: 77,639 / Avg: 51,743)
INFO  Add: 1,000,000 Data (Batch: 81,833 / Avg: 64,926)
INFO  Add: 2,000,000 Data (Batch: 84,745 / Avg: 72,745)
INFO  Add: 3,000,000 Data (Batch: 79,365 / Avg: 76,591)
INFO  Add: 4,000,000 Data (Batch: 91,575 / Avg: 77,605)
INFO  Add: 5,000,000 Data (Batch: 3,582 / Avg: 49,010)
INFO  Add: 6,000,000 Data (Batch: 3,915 / Avg: 22,031)
INFO  Add: 7,000,000 Data (Batch: 11,887 / Avg: 16,724)
INFO  Add: 8,000,000 Data (Batch: 4,121 / Avg: 15,455)
INFO  Add: 9,000,000 Data (Batch: 24,038 / Avg: 14,804)

I wonder if this is normal or if there's anything I can do to speed this up.

Re: Very slow tdbloader2 insertion

Posted by Andy Seaborne <an...@apache.org>.

On 17/04/17 23:07, Laura Morales wrote:
>> tdbloader2 builds b+trees from bottom to top, given sorted input. As
>> such blocks are streamed to disk which is disk-efficient.
>>
>> It is a series of java programs scripted together by a shell script.
>>
>> tdbloader is pure java. It builds the b+trees by inserting, which for
>> some idndxes is not optimal because it causes random inserts leading to
>> random I/O, which is bad for disk performance.
>>
>> Andy
>
>
> But why is tdbloader better for smaller datasets, whereas tdbloader2 is better for very large dataset ("100M+ triples")? Wouldn't the approach of tdbloader2 be superior in all cases?
>

Try them both and see!

tdbloader2 has high overhead.

On small datasets (less than 100m), an index fits in the OS disk cache 
so tdbloader I/O is effectively "in-memory" and the randomness is not a 
problem.  When it spills, it slows down quite markedly.

tdbloader2 is a slower algorithm but does not produce this "fall-off 
effect" on index writing.

     Andy



Re: Very slow tdbloader2 insertion

Posted by Laura Morales <la...@mail.com>.
> tdbloader2 builds b+trees from bottom to top, given sorted input. As
> such blocks are streamed to disk which is disk-efficient.
> 
> It is a series of java programs scripted together by a shell script.
> 
> tdbloader is pure java. It builds the b+trees by inserting, which for
> some idndxes is not optimal because it causes random inserts leading to
> random I/O, which is bad for disk performance.
> 
> Andy


But why is tdbloader better for smaller datasets, whereas tdbloader2 is better for very large dataset ("100M+ triples")? Wouldn't the approach of tdbloader2 be superior in all cases?

Re: Very slow tdbloader2 insertion

Posted by Andy Seaborne <an...@apache.org>.
tdbloader2 builds b+trees from bottom to top, given sorted input.  As 
such blocks are streamed to disk which is disk-efficient.

It is a series of java programs scripted together by a shell script.

tdbloader is pure java.  It builds the b+trees by inserting, which for 
some idndxes is not optimal because it causes random inserts leading to 
random I/O, which is bad for disk performance.

     Andy



On 15/04/17 22:13, A. Soroka wrote:
> To start with, tdbloader2 uses the assumption that the tuples are sorted (actually, it sorts them, then uses that assumption) as described in this old blog post of Andy's:
>
> https://seaborne.blogspot.com/2010/12/repacking-btrees.html
>
> That's one reason that you only want to use tbdloader2 to start from scratch. Andy, of course, can say more.
>
> ---
> A. Soroka
> The University of Virginia Library
>
>> On Apr 15, 2017, at 2:58 PM, Laura Morales <la...@mail.com> wrote:
>>
>>> Use tdbloader for 10M quads.
>>
>> I wonder how is tdbloader technically different from tdbloader2. What makes tdbloader more suited for small/medium datasets and tdbloader2 more suited for very large datasets? Do they implement different insertion algorithms?
>

Re: Very slow tdbloader2 insertion

Posted by "A. Soroka" <aj...@virginia.edu>.
To start with, tdbloader2 uses the assumption that the tuples are sorted (actually, it sorts them, then uses that assumption) as described in this old blog post of Andy's:

https://seaborne.blogspot.com/2010/12/repacking-btrees.html

That's one reason that you only want to use tbdloader2 to start from scratch. Andy, of course, can say more.

---
A. Soroka
The University of Virginia Library

> On Apr 15, 2017, at 2:58 PM, Laura Morales <la...@mail.com> wrote:
> 
>> Use tdbloader for 10M quads.
> 
> I wonder how is tdbloader technically different from tdbloader2. What makes tdbloader more suited for small/medium datasets and tdbloader2 more suited for very large datasets? Do they implement different insertion algorithms?


Re: Very slow tdbloader2 insertion

Posted by Laura Morales <la...@mail.com>.
> Use tdbloader for 10M quads.

I wonder how is tdbloader technically different from tdbloader2. What makes tdbloader more suited for small/medium datasets and tdbloader2 more suited for very large datasets? Do they implement different insertion algorithms?

Re: Very slow tdbloader2 insertion

Posted by Andy Seaborne <an...@apache.org>.
Use tdbloader for 10M quads.

As to why the load stage of tdbloder2 drops off, we'd need to know more 
about the environment you are running in.

What is the machine? The disk?
How much RAM does the machine have?
Is there anything else running on the machine?
Have you set the heap size or taken the defaults?

	Andy

On 15/04/17 11:55, Laura Morales wrote:
> I've made a dataset with about 10M nquads, 5-6 graphs, stored as a single .nq file.
> I've launched tdbloader2 to create a new dataset from this file, but I see a constant and remarkable slow down as more nquads are added to the dataset. Here are some INFO during processing:
>
> INFO  Add: 50,000 Data (Batch: 12,983 / Avg: 12,983)
> INFO  Add: 500,000 Data (Batch: 77,639 / Avg: 51,743)
> INFO  Add: 1,000,000 Data (Batch: 81,833 / Avg: 64,926)
> INFO  Add: 2,000,000 Data (Batch: 84,745 / Avg: 72,745)
> INFO  Add: 3,000,000 Data (Batch: 79,365 / Avg: 76,591)
> INFO  Add: 4,000,000 Data (Batch: 91,575 / Avg: 77,605)
> INFO  Add: 5,000,000 Data (Batch: 3,582 / Avg: 49,010)
> INFO  Add: 6,000,000 Data (Batch: 3,915 / Avg: 22,031)
> INFO  Add: 7,000,000 Data (Batch: 11,887 / Avg: 16,724)
> INFO  Add: 8,000,000 Data (Batch: 4,121 / Avg: 15,455)
> INFO  Add: 9,000,000 Data (Batch: 24,038 / Avg: 14,804)
>
> I wonder if this is normal or if there's anything I can do to speed this up.
>