You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Cristóbal Miranda <cr...@gmail.com> on 2021/09/12 00:39:56 UTC

Faster TDB2 build?

Hi,

I'm running tdb2.tdbloader on Wikidata, but it's
taking too long, now it's on day 11 and still indexing,
whereas tdbloader2 (for TDB) didn't take as much for me.
I was wondering if something could be done to allow
more space on RAM for the build phase in order to be faster,
for example passing a memory budget parameter to the
loader. Not sure exactly how the extra RAM space would be
used, but I was thinking that maybe if more b+tree blocks
were kept in RAM this processing would be faster, for
example keeping 2 upper levels of the tree in primary memory,
or even everything in there if the given budget allowed it.

What would it take to implement such a feature? maybe in a
tdb2.tdbloader2? I was looking at the code for a way to do something
but couldn't find an easy modification to achieve this.

Re: Faster TDB2 build?

Posted by Andy Seaborne <an...@apache.org>.


On 14/09/2021 17:26, Cristóbal Miranda wrote:
>>
>> tdb2.tdbloader has a number of loading algorithms - which one are you
>> using?
> 
> 
> The default one, phased.
> 
>   How big is the machine (RAM size, heap size)?
> 
> 
> RAM size: 736G
> For heap size do you mean Xmx? if that is the case it is 60G,
> but I see that no more than 6GB are being used. However, I see
> almost 60GB of swap memory being used.

Its doesn't need 60G heap. 8G is probably enough

It should not need to swap but whether the swap figures includes mapped 
files is unclear. Seems different machines report things in different ways.

What is causing the slowness is I/O saturation and its the bottom of the 
trees which ave blocks used infrequently.

Presumably the CPU loads is not very high?

> 
> Do you know how the SSD is connected? SATA? NVMe?
> 
> 
> I don't know that, I could ask someone if necessary.
> 
> The tops of B+trees currently being worked on should naturally end up
>> cached from the filing system in the OS filing system cache in RAM. As
>> mapped byte buffers it is as fast, or faster, than heap RAM.
> 
> 
> What is being cached? 

Areas of a file - blocks.

A block is 8k, the trees are about ~200-way B+Trees for triples. The key 
is 24 or 32 bytes, no value.

Operations on the B+trees happens directly on blocks.

> the nodes on the current branch of the tree or
> complete upper levels? I'm thinking that if blocks from upper levels are
> retrieved from disk repeatedly between insertions (and also having to do
> splits)
> can degrade performance a lot, especially when the amount of data
> is getting big, because too much random access would have to be done.
> I see that ids are used to find the blocks in the file, could it be possible
> for example to have a HashMap mapping ids to blocks in BlockAccessMapped
> and try to retrieve from the hashmap when the id corresponds to an upper
> level block and sync when everything is done? The idea being that those
> upper
> levels will occupy some MBs which is not that expensive to keep in memory
> and
> it would require less access to disk and also reduce random accesses in
> benefit
> of more sequential writes for lower-level blocks. This, of course, would
> only happen
> when building the index.
> Do you think that something like this could improve build performance?

Maybe, but I think that tdbloader2 which causes the I/O to be ordered 
because it builds the trees, bottom up in sorted order.

For the majority of cases this sorting cost isn't worth it with SSDs 
because there is no large seek time on random I/O.

But at wikidata scale, the I/O bandwidth gets used up. MVNe/PCIe SSDs 
are better than SATA connected for this.

> 
> Related to this, how many children can a block have? 2048, 1024?

About 200.

> 
> 
> 
>> I wonder if we can created wikidata databases once then publish the
>> database
> 
> 
> That would be nice, but as you say it can be troublesome, especially trying
> to have a version which is not too old compared to their latest dumped
> dataset.
> 
> 
> 
> On Mon, 13 Sept 2021 at 06:44, Andy Seaborne <an...@apache.org> wrote:
> 
>> Hi there,
>>
>> Thanks for the information and experience report.  Always good to hear
>> what happens in a variety of situations.
>>
>> A few details:
>>
>> tdb2.tdbloader has a number of loading algorithms - which one are you
>> using? While they are different parameters to a common algorithm, they
>> have different characteristics. (The fastest - the parallel loader - is
>> not the best at large scale)
>>
>> What's the hardware being used?
>>     How big is the machine (RAM size, heap size)?
>>     Do you know how the SSD is connected? SATA? NVMe?
>>
>>
>> It should be possible to port tdbloader2 to TDB2.  tdbloader2 is
>> fundamentally different to the other loaders. For the majority of use
>> cases, its advantages don't show up with an SSD (it originates from the
>> disk-era!). But wikidata isn't one of those majority.
>>
>> The tops of B+trees currently being worked on should naturally end up
>> cached from the filing system in the OS filing system cache in RAM. As
>> mapped byte buffers it is as fast, or faster, than heap RAM.
>>
>> Related thought:
>>
>> I wonder if we can created wikidata databases once then publish the
>> database. A database can be published as a compressed zip file of the
>> directory and the compression ration is quite high. Even so, working
>> with large files is still going to be non-trivial and we'd need
>> somewhere to put them that can also supply the bandwidth.
>>
>> (Also - HDT maybe - don't know how that performs on read at this scale)
>>
>>       Andy
>>
>> On 12/09/2021 20:12, Cristóbal Miranda wrote:
>>> SSD. First phase was 50-90k triples per second until 3B triples
>>> where it started going down from 50k to 20k per second (took 3 days).
>>> SPO => SPO->POS, SPO->OSP phase was 25-50k per second
>>> until 1B where it went from 25k to 4k triples per second,
>>> currently at 3.7B triples.
>>>
>>>
>>>
>>> On Sun, 12 Sept 2021 at 04:59, Laura Morales <la...@mail.com> wrote:
>>>
>>>> Just a personal curiosity... are you building it on a SSD or HDD? What
>> is
>>>> your "triples loaded per second" rate?
>>>>
>>>>
>>>>> Sent: Sunday, September 12, 2021 at 2:39 AM
>>>>> From: "Cristóbal Miranda" <cr...@gmail.com>
>>>>> To: users@jena.apache.org
>>>>> Subject: Faster TDB2 build?
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'm running tdb2.tdbloader on Wikidata, but it's
>>>>> taking too long, now it's on day 11 and still indexing,
>>>>> whereas tdbloader2 (for TDB) didn't take as much for me.
>>>>> I was wondering if something could be done to allow
>>>>> more space on RAM for the build phase in order to be faster,
>>>>> for example passing a memory budget parameter to the
>>>>> loader. Not sure exactly how the extra RAM space would be
>>>>> used, but I was thinking that maybe if more b+tree blocks
>>>>> were kept in RAM this processing would be faster, for
>>>>> example keeping 2 upper levels of the tree in primary memory,
>>>>> or even everything in there if the given budget allowed it.
>>>>>
>>>>> What would it take to implement such a feature? maybe in a
>>>>> tdb2.tdbloader2? I was looking at the code for a way to do something
>>>>> but couldn't find an easy modification to achieve this.
>>>>>
>>>>
>>>
>>
>

Re: Faster TDB2 build?

Posted by Cristóbal Miranda <cr...@gmail.com>.

>
> tdb2.tdbloader has a number of loading algorithms - which one are you
> using?

The default one, phased.

 How big is the machine (RAM size, heap size)?

RAM size: 736G
For heap size do you mean Xmx? if that is the case it is 60G,
but I see that no more than 6GB are being used. However, I see
almost 60GB of swap memory being used.

Do you know how the SSD is connected? SATA? NVMe?

I don't know that, I could ask someone if necessary.

The tops of B+trees currently being worked on should naturally end up
> cached from the filing system in the OS filing system cache in RAM. As
> mapped byte buffers it is as fast, or faster, than heap RAM.

What is being cached? the nodes on the current branch of the tree or
complete upper levels? I'm thinking that if blocks from upper levels are
retrieved from disk repeatedly between insertions (and also having to do
splits)
can degrade performance a lot, especially when the amount of data
is getting big, because too much random access would have to be done.
I see that ids are used to find the blocks in the file, could it be possible
for example to have a HashMap mapping ids to blocks in BlockAccessMapped
and try to retrieve from the hashmap when the id corresponds to an upper
level block and sync when everything is done? The idea being that those
upper
levels will occupy some MBs which is not that expensive to keep in memory
and
it would require less access to disk and also reduce random accesses in
benefit
of more sequential writes for lower-level blocks. This, of course, would
only happen
when building the index.
Do you think that something like this could improve build performance?

Related to this, how many children can a block have? 2048, 1024?

> I wonder if we can created wikidata databases once then publish the
> database

That would be nice, but as you say it can be troublesome, especially trying
to have a version which is not too old compared to their latest dumped
dataset.

On Mon, 13 Sept 2021 at 06:44, Andy Seaborne <an...@apache.org> wrote:

> Hi there,
>
> Thanks for the information and experience report.  Always good to hear
> what happens in a variety of situations.
>
> A few details:
>
> tdb2.tdbloader has a number of loading algorithms - which one are you
> using? While they are different parameters to a common algorithm, they
> have different characteristics. (The fastest - the parallel loader - is
> not the best at large scale)
>
> What's the hardware being used?
>    How big is the machine (RAM size, heap size)?
>    Do you know how the SSD is connected? SATA? NVMe?
>
>
> It should be possible to port tdbloader2 to TDB2.  tdbloader2 is
> fundamentally different to the other loaders. For the majority of use
> cases, its advantages don't show up with an SSD (it originates from the
> disk-era!). But wikidata isn't one of those majority.
>
> The tops of B+trees currently being worked on should naturally end up
> cached from the filing system in the OS filing system cache in RAM. As
> mapped byte buffers it is as fast, or faster, than heap RAM.
>
> Related thought:
>
> I wonder if we can created wikidata databases once then publish the
> database. A database can be published as a compressed zip file of the
> directory and the compression ration is quite high. Even so, working
> with large files is still going to be non-trivial and we'd need
> somewhere to put them that can also supply the bandwidth.
>
> (Also - HDT maybe - don't know how that performs on read at this scale)
>
>      Andy
>
> On 12/09/2021 20:12, Cristóbal Miranda wrote:
> > SSD. First phase was 50-90k triples per second until 3B triples
> > where it started going down from 50k to 20k per second (took 3 days).
> > SPO => SPO->POS, SPO->OSP phase was 25-50k per second
> > until 1B where it went from 25k to 4k triples per second,
> > currently at 3.7B triples.
> >
> >
> >
> > On Sun, 12 Sept 2021 at 04:59, Laura Morales <la...@mail.com> wrote:
> >
> >> Just a personal curiosity... are you building it on a SSD or HDD? What
> is
> >> your "triples loaded per second" rate?
> >>
> >>
> >>> Sent: Sunday, September 12, 2021 at 2:39 AM
> >>> From: "Cristóbal Miranda" <cr...@gmail.com>
> >>> To: users@jena.apache.org
> >>> Subject: Faster TDB2 build?
> >>>
> >>> Hi,
> >>>
> >>> I'm running tdb2.tdbloader on Wikidata, but it's
> >>> taking too long, now it's on day 11 and still indexing,
> >>> whereas tdbloader2 (for TDB) didn't take as much for me.
> >>> I was wondering if something could be done to allow
> >>> more space on RAM for the build phase in order to be faster,
> >>> for example passing a memory budget parameter to the
> >>> loader. Not sure exactly how the extra RAM space would be
> >>> used, but I was thinking that maybe if more b+tree blocks
> >>> were kept in RAM this processing would be faster, for
> >>> example keeping 2 upper levels of the tree in primary memory,
> >>> or even everything in there if the given budget allowed it.
> >>>
> >>> What would it take to implement such a feature? maybe in a
> >>> tdb2.tdbloader2? I was looking at the code for a way to do something
> >>> but couldn't find an easy modification to achieve this.
> >>>
> >>
> >
>

Re: Faster TDB2 build?

Posted by Andy Seaborne <an...@apache.org>.

Hi there,

Thanks for the information and experience report.  Always good to hear 
what happens in a variety of situations.

A few details:

tdb2.tdbloader has a number of loading algorithms - which one are you 
using? While they are different parameters to a common algorithm, they 
have different characteristics. (The fastest - the parallel loader - is 
not the best at large scale)

What's the hardware being used?
   How big is the machine (RAM size, heap size)?
   Do you know how the SSD is connected? SATA? NVMe?

It should be possible to port tdbloader2 to TDB2.  tdbloader2 is 
fundamentally different to the other loaders. For the majority of use 
cases, its advantages don't show up with an SSD (it originates from the 
disk-era!). But wikidata isn't one of those majority.

The tops of B+trees currently being worked on should naturally end up 
cached from the filing system in the OS filing system cache in RAM. As 
mapped byte buffers it is as fast, or faster, than heap RAM.

Related thought:

I wonder if we can created wikidata databases once then publish the 
database. A database can be published as a compressed zip file of the 
directory and the compression ration is quite high. Even so, working 
with large files is still going to be non-trivial and we'd need 
somewhere to put them that can also supply the bandwidth.

(Also - HDT maybe - don't know how that performs on read at this scale)

     Andy

On 12/09/2021 20:12, Cristóbal Miranda wrote:
> SSD. First phase was 50-90k triples per second until 3B triples
> where it started going down from 50k to 20k per second (took 3 days).
> SPO => SPO->POS, SPO->OSP phase was 25-50k per second
> until 1B where it went from 25k to 4k triples per second,
> currently at 3.7B triples.
> 
> 
> 
> On Sun, 12 Sept 2021 at 04:59, Laura Morales <la...@mail.com> wrote:
> 
>> Just a personal curiosity... are you building it on a SSD or HDD? What is
>> your "triples loaded per second" rate?
>>
>>
>>> Sent: Sunday, September 12, 2021 at 2:39 AM
>>> From: "Cristóbal Miranda" <cr...@gmail.com>
>>> To: users@jena.apache.org
>>> Subject: Faster TDB2 build?
>>>
>>> Hi,
>>>
>>> I'm running tdb2.tdbloader on Wikidata, but it's
>>> taking too long, now it's on day 11 and still indexing,
>>> whereas tdbloader2 (for TDB) didn't take as much for me.
>>> I was wondering if something could be done to allow
>>> more space on RAM for the build phase in order to be faster,
>>> for example passing a memory budget parameter to the
>>> loader. Not sure exactly how the extra RAM space would be
>>> used, but I was thinking that maybe if more b+tree blocks
>>> were kept in RAM this processing would be faster, for
>>> example keeping 2 upper levels of the tree in primary memory,
>>> or even everything in there if the given budget allowed it.
>>>
>>> What would it take to implement such a feature? maybe in a
>>> tdb2.tdbloader2? I was looking at the code for a way to do something
>>> but couldn't find an easy modification to achieve this.
>>>
>>
>

Re: Faster TDB2 build?

Posted by Cristóbal Miranda <cr...@gmail.com>.

SSD. First phase was 50-90k triples per second until 3B triples
where it started going down from 50k to 20k per second (took 3 days).
SPO => SPO->POS, SPO->OSP phase was 25-50k per second
until 1B where it went from 25k to 4k triples per second,
currently at 3.7B triples.



On Sun, 12 Sept 2021 at 04:59, Laura Morales <la...@mail.com> wrote:

> Just a personal curiosity... are you building it on a SSD or HDD? What is
> your "triples loaded per second" rate?
>
>
> > Sent: Sunday, September 12, 2021 at 2:39 AM
> > From: "Cristóbal Miranda" <cr...@gmail.com>
> > To: users@jena.apache.org
> > Subject: Faster TDB2 build?
> >
> > Hi,
> >
> > I'm running tdb2.tdbloader on Wikidata, but it's
> > taking too long, now it's on day 11 and still indexing,
> > whereas tdbloader2 (for TDB) didn't take as much for me.
> > I was wondering if something could be done to allow
> > more space on RAM for the build phase in order to be faster,
> > for example passing a memory budget parameter to the
> > loader. Not sure exactly how the extra RAM space would be
> > used, but I was thinking that maybe if more b+tree blocks
> > were kept in RAM this processing would be faster, for
> > example keeping 2 upper levels of the tree in primary memory,
> > or even everything in there if the given budget allowed it.
> >
> > What would it take to implement such a feature? maybe in a
> > tdb2.tdbloader2? I was looking at the code for a way to do something
> > but couldn't find an easy modification to achieve this.
> >
>

Re: Faster TDB2 build?

Posted by Laura Morales <la...@mail.com>.

Just a personal curiosity... are you building it on a SSD or HDD? What is your "triples loaded per second" rate?


> Sent: Sunday, September 12, 2021 at 2:39 AM
> From: "Cristóbal Miranda" <cr...@gmail.com>
> To: users@jena.apache.org
> Subject: Faster TDB2 build?
>
> Hi,
> 
> I'm running tdb2.tdbloader on Wikidata, but it's
> taking too long, now it's on day 11 and still indexing,
> whereas tdbloader2 (for TDB) didn't take as much for me.
> I was wondering if something could be done to allow
> more space on RAM for the build phase in order to be faster,
> for example passing a memory budget parameter to the
> loader. Not sure exactly how the extra RAM space would be
> used, but I was thinking that maybe if more b+tree blocks
> were kept in RAM this processing would be faster, for
> example keeping 2 upper levels of the tree in primary memory,
> or even everything in there if the given budget allowed it.
> 
> What would it take to implement such a feature? maybe in a
> tdb2.tdbloader2? I was looking at the code for a way to do something
> but couldn't find an easy modification to achieve this.
>