You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by "A. Soroka" <aj...@virginia.edu> on 2016/10/20 20:01:39 UTC

tdbloader2 hex files

I've been tinkering with tdbloader2 and optimizing a use of it, and now I've got a question about the intermediate files created (data-quads.tmp and data-triples.tmp).

They contain hexadecimal numbers that represent node IDs in tuples; the contents of the database to be build in tabular form. The TDB node IDs are 64 bit integers, if I remember correctly, and as I say, they are represented in these files as long hex strings. These data files are sorted before being packed into indexes, and that sort occurs by using plain old POSIX `sort`.

If `sort` is the tool to be used (or at least the default, since it can be aliased out if appropriate) wouldn't it make more sense for those IDs to be decimal-radix integers, so that numeric comparison (which is often faster because it avoids locale machinery; even 'C' locale has some work involved) could be used in `sort`? To my knowledge, most `sort`s out there won't handle hex with numeric comparison-- only string comparison.

Or am I (as I often am) missing something about how those numbers get used? Obviously, decimal versions of the IDs would be darn big numbers and less readable, but if that is the concern, would there be any objection to providing a switch on the utility to choose which radix and comparison function to use?

---
A. Soroka
The University of Virginia Library

Re: tdbloader2 hex files

Posted by "A. Soroka" <aj...@virginia.edu>.

Looks like tdbloader2 is currently using a nice non-Unicode locale, so that's covered.

https://github.com/apache/jena/blob/master/apache-jena/bin/tdbloader2index#L146

---
A. Soroka
The University of Virginia Library

> On Oct 21, 2016, at 9:22 AM, Stian Soiland-Reyes <st...@apache.org> wrote:
> 
> Isn't numerical sorting a fair bit slower if you need to convert from
> decimal to binary representation first? The algorithm for this is quite
> convoluted and don't have fixed costs - until recently there was even a bug
> on some platforms were a particular string caused an infinite loop. (But
> that might have been to floating points :-)
> 
> Byte-by-byte comparison without unicode should be fairly fast.. but worth
> checking if "sort" is a slowdown (I didn't think it was the slowest bit of
> tdbloader2)
> 
> On 20 Oct 2016 10:01 pm, "A. Soroka" <aj...@virginia.edu> wrote:
> 
>> I've been tinkering with tdbloader2 and optimizing a use of it, and now
>> I've got a question about the intermediate files created (data-quads.tmp
>> and data-triples.tmp).
>> 
>> They contain hexadecimal numbers that represent node IDs in tuples; the
>> contents of the database to be build in tabular form. The TDB node IDs are
>> 64 bit integers, if I remember correctly, and as I say, they are
>> represented in these files as long hex strings. These data files are sorted
>> before being packed into indexes, and that sort occurs by using plain old
>> POSIX `sort`.
>> 
>> If `sort` is the tool to be used (or at least the default, since it can be
>> aliased out if appropriate) wouldn't it make more sense for those IDs to be
>> decimal-radix integers, so that numeric comparison (which is often faster
>> because it avoids locale machinery; even 'C' locale has some work involved)
>> could be used in `sort`? To my knowledge, most `sort`s out there won't
>> handle hex with numeric comparison-- only string comparison.
>> 
>> Or am I (as I often am) missing something about how those numbers get
>> used? Obviously, decimal versions of the IDs would be darn big numbers and
>> less readable, but if that is the concern, would there be any objection to
>> providing a switch on the utility to choose which radix and comparison
>> function to use?
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>>

Re: tdbloader2 hex files

Posted by Andy Seaborne <an...@apache.org>.

Sorting hex can be done by string comparison : A-F > 0-9 in ASCII.

With enough padding so can decimal but hex is natural.

The loader needs some kind of external sort - ideally it would be based 
on binary. sort(1) is widely available and handles all the spill-merge 
work. It is well-tried in a wide variety of environments.  Sorting at 
scale is not a trivial task.

     Andy

On 21/10/16 14:27, A. Soroka wrote:
> I haven't heard anything about such problems with straight integers (which is what we have here) but I may very well just not have come across it. Indeed, keeping a non-Unicode locale helps a great deal, and there are probably other places that TDB loading could go faster-- I'm just looking for low-hanging fruit and I am also honestly curious why text files with hex was chosen (instead, for example, of some very compact format with a sort algorithm in the Java).
>
> My (very rough, not carefully controlled) example with about 300Mt showed that sort was actually a good chunk of the index phase (as opposed to the data phase).  It's not obvious to me that there could be anything special about my data, but there might be, I suppose.
>
> ---
> A. Soroka
> The University of Virginia Library
>
>> On Oct 21, 2016, at 9:22 AM, Stian Soiland-Reyes <st...@apache.org> wrote:
>>
>> Isn't numerical sorting a fair bit slower if you need to convert from
>> decimal to binary representation first? The algorithm for this is quite
>> convoluted and don't have fixed costs - until recently there was even a bug
>> on some platforms were a particular string caused an infinite loop. (But
>> that might have been to floating points :-)
>>
>> Byte-by-byte comparison without unicode should be fairly fast.. but worth
>> checking if "sort" is a slowdown (I didn't think it was the slowest bit of
>> tdbloader2)
>>
>> On 20 Oct 2016 10:01 pm, "A. Soroka" <aj...@virginia.edu> wrote:
>>
>>> I've been tinkering with tdbloader2 and optimizing a use of it, and now
>>> I've got a question about the intermediate files created (data-quads.tmp
>>> and data-triples.tmp).
>>>
>>> They contain hexadecimal numbers that represent node IDs in tuples; the
>>> contents of the database to be build in tabular form. The TDB node IDs are
>>> 64 bit integers, if I remember correctly, and as I say, they are
>>> represented in these files as long hex strings. These data files are sorted
>>> before being packed into indexes, and that sort occurs by using plain old
>>> POSIX `sort`.
>>>
>>> If `sort` is the tool to be used (or at least the default, since it can be
>>> aliased out if appropriate) wouldn't it make more sense for those IDs to be
>>> decimal-radix integers, so that numeric comparison (which is often faster
>>> because it avoids locale machinery; even 'C' locale has some work involved)
>>> could be used in `sort`? To my knowledge, most `sort`s out there won't
>>> handle hex with numeric comparison-- only string comparison.
>>>
>>> Or am I (as I often am) missing something about how those numbers get
>>> used? Obviously, decimal versions of the IDs would be darn big numbers and
>>> less readable, but if that is the concern, would there be any objection to
>>> providing a switch on the utility to choose which radix and comparison
>>> function to use?
>>>
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>>
>>>
>

Re: tdbloader2 hex files

Posted by "A. Soroka" <aj...@virginia.edu>.

I haven't heard anything about such problems with straight integers (which is what we have here) but I may very well just not have come across it. Indeed, keeping a non-Unicode locale helps a great deal, and there are probably other places that TDB loading could go faster-- I'm just looking for low-hanging fruit and I am also honestly curious why text files with hex was chosen (instead, for example, of some very compact format with a sort algorithm in the Java). 

My (very rough, not carefully controlled) example with about 300Mt showed that sort was actually a good chunk of the index phase (as opposed to the data phase).  It's not obvious to me that there could be anything special about my data, but there might be, I suppose.

---
A. Soroka
The University of Virginia Library

> On Oct 21, 2016, at 9:22 AM, Stian Soiland-Reyes <st...@apache.org> wrote:
> 
> Isn't numerical sorting a fair bit slower if you need to convert from
> decimal to binary representation first? The algorithm for this is quite
> convoluted and don't have fixed costs - until recently there was even a bug
> on some platforms were a particular string caused an infinite loop. (But
> that might have been to floating points :-)
> 
> Byte-by-byte comparison without unicode should be fairly fast.. but worth
> checking if "sort" is a slowdown (I didn't think it was the slowest bit of
> tdbloader2)
> 
> On 20 Oct 2016 10:01 pm, "A. Soroka" <aj...@virginia.edu> wrote:
> 
>> I've been tinkering with tdbloader2 and optimizing a use of it, and now
>> I've got a question about the intermediate files created (data-quads.tmp
>> and data-triples.tmp).
>> 
>> They contain hexadecimal numbers that represent node IDs in tuples; the
>> contents of the database to be build in tabular form. The TDB node IDs are
>> 64 bit integers, if I remember correctly, and as I say, they are
>> represented in these files as long hex strings. These data files are sorted
>> before being packed into indexes, and that sort occurs by using plain old
>> POSIX `sort`.
>> 
>> If `sort` is the tool to be used (or at least the default, since it can be
>> aliased out if appropriate) wouldn't it make more sense for those IDs to be
>> decimal-radix integers, so that numeric comparison (which is often faster
>> because it avoids locale machinery; even 'C' locale has some work involved)
>> could be used in `sort`? To my knowledge, most `sort`s out there won't
>> handle hex with numeric comparison-- only string comparison.
>> 
>> Or am I (as I often am) missing something about how those numbers get
>> used? Obviously, decimal versions of the IDs would be darn big numbers and
>> less readable, but if that is the concern, would there be any objection to
>> providing a switch on the utility to choose which radix and comparison
>> function to use?
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>>

Re: tdbloader2 hex files

Posted by Stian Soiland-Reyes <st...@apache.org>.

Isn't numerical sorting a fair bit slower if you need to convert from
decimal to binary representation first? The algorithm for this is quite
convoluted and don't have fixed costs - until recently there was even a bug
on some platforms were a particular string caused an infinite loop. (But
that might have been to floating points :-)

Byte-by-byte comparison without unicode should be fairly fast.. but worth
checking if "sort" is a slowdown (I didn't think it was the slowest bit of
tdbloader2)

On 20 Oct 2016 10:01 pm, "A. Soroka" <aj...@virginia.edu> wrote:

> I've been tinkering with tdbloader2 and optimizing a use of it, and now
> I've got a question about the intermediate files created (data-quads.tmp
> and data-triples.tmp).
>
> They contain hexadecimal numbers that represent node IDs in tuples; the
> contents of the database to be build in tabular form. The TDB node IDs are
> 64 bit integers, if I remember correctly, and as I say, they are
> represented in these files as long hex strings. These data files are sorted
> before being packed into indexes, and that sort occurs by using plain old
> POSIX `sort`.
>
> If `sort` is the tool to be used (or at least the default, since it can be
> aliased out if appropriate) wouldn't it make more sense for those IDs to be
> decimal-radix integers, so that numeric comparison (which is often faster
> because it avoids locale machinery; even 'C' locale has some work involved)
> could be used in `sort`? To my knowledge, most `sort`s out there won't
> handle hex with numeric comparison-- only string comparison.
>
> Or am I (as I often am) missing something about how those numbers get
> used? Obviously, decimal versions of the IDs would be darn big numbers and
> less readable, but if that is the concern, would there be any objection to
> providing a switch on the utility to choose which radix and comparison
> function to use?
>
> ---
> A. Soroka
> The University of Virginia Library
>
>