You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@marmotta.apache.org by Blake Regalia <bl...@gmail.com> on 2018/02/13 20:11:26 UTC

Reason for not using auto-incrementing id field?

What was the justification for using the 'snowflake' bigint type for the id
fields on nodes, triples and namespaces?

 - Blake

Re: Reason for not using auto-incrementing id field?

Posted by Blake Regalia <bl...@gmail.com>.

Thanks Sebastian, I was also curious if someone ran perf tests for this.
Good to know! I can see how it would speed up bulk importing if (1) the
importer assumes the ids of existing nodes will not change during execution
and (2) keeps a synced mapping of datatype and predicate ids in memory and
then (3) inserts nodes/triples that reference the ids of nodes it inserted
in the previous query since this would eliminate the overhead of the db
returning the auto-incremented ids from insertion.

I haven't read deep enough into the importer code to see if this is a
strategy the importer is already using? Unfortunately (on 584 at least,
where geometry is another ntype) the importer really struggles beyond a few
million nodes, the bottleneck seeming to be the process of checking if a
node already exists. I haven't run perf tests yet, but I've added this
unique index to my (postgres) nodes table for now so that I can 'DO
NOTHING' on insert conflict while I am trying out a multi-threaded importer
from node.js:

CREATE UNIQUE INDEX idx_node_essence ON nodes(ntype, svalue, ltype, lang);

 - Blake

On Tue, Feb 13, 2018 at 10:56 PM, Sebastian Schaffert <
sebastian.schaffert@gmail.com> wrote:

> Hi Blake,
>
> I did performance tests back then, it actually makes a significant
> difference on most databases, especially for batch imports. Even more if
> the database is not running on localhost. Not sure about the actual numbers
> though. You can always switch to the database sequence generator for IDs if
> you want to try it out yourself, I think it's still available and it's a
> simple configuration option.
>
> Sebastian
>
>
> Blake Regalia <bl...@gmail.com> schrieb am Mi., 14. Feb. 2018,
> 01:00:
>
>> I can see how this makes sense for future compatibility with distributed
>> systems across a variety of RDBMS, although I'm not convinced it's more
>> efficient for single nodes (e.g., auto-incrementing fields do not require
>> round trips). Thanks for the reply! Just wanted to know while porting a
>> bulk importer for 584.
>>
>>
>>  - Blake
>>
>> On Tue, Feb 13, 2018 at 12:15 PM, Sebastian Schaffert <
>> sebastian.schaffert@gmail.com> wrote:
>>
>>> Hi Blake,
>>>
>>> Auto-increment requires querying the database for the next sequence
>>> number (or the last given ID, depending on the database you use), and
>>> that's adding another database roundtrip. Snowflake is purely in code, very
>>> fast to compute, and safe even in distributed setups.
>>>
>>> Is it causing problems?
>>>
>>> Sebastian
>>>
>>> Blake Regalia <bl...@gmail.com> schrieb am Di., 13. Feb. 2018,
>>> 21:11:
>>>
>>>> What was the justification for using the 'snowflake' bigint type for
>>>> the id fields on nodes, triples and namespaces?
>>>>
>>>>
>>>>  - Blake
>>>>
>>>
>>

Re: Reason for not using auto-incrementing id field?

Posted by Sebastian Schaffert <se...@gmail.com>.

Hi Blake,

I did performance tests back then, it actually makes a significant
difference on most databases, especially for batch imports. Even more if
the database is not running on localhost. Not sure about the actual numbers
though. You can always switch to the database sequence generator for IDs if
you want to try it out yourself, I think it's still available and it's a
simple configuration option.

Sebastian

Blake Regalia <bl...@gmail.com> schrieb am Mi., 14. Feb. 2018,
01:00:

> I can see how this makes sense for future compatibility with distributed
> systems across a variety of RDBMS, although I'm not convinced it's more
> efficient for single nodes (e.g., auto-incrementing fields do not require
> round trips). Thanks for the reply! Just wanted to know while porting a
> bulk importer for 584.
>
>
>  - Blake
>
> On Tue, Feb 13, 2018 at 12:15 PM, Sebastian Schaffert <
> sebastian.schaffert@gmail.com> wrote:
>
>> Hi Blake,
>>
>> Auto-increment requires querying the database for the next sequence
>> number (or the last given ID, depending on the database you use), and
>> that's adding another database roundtrip. Snowflake is purely in code, very
>> fast to compute, and safe even in distributed setups.
>>
>> Is it causing problems?
>>
>> Sebastian
>>
>> Blake Regalia <bl...@gmail.com> schrieb am Di., 13. Feb. 2018,
>> 21:11:
>>
>>> What was the justification for using the 'snowflake' bigint type for the
>>> id fields on nodes, triples and namespaces?
>>>
>>>
>>>  - Blake
>>>
>>
>

Re: Reason for not using auto-incrementing id field?

Posted by Blake Regalia <bl...@gmail.com>.

I can see how this makes sense for future compatibility with distributed
systems across a variety of RDBMS, although I'm not convinced it's more
efficient for single nodes (e.g., auto-incrementing fields do not require
round trips). Thanks for the reply! Just wanted to know while porting a
bulk importer for 584.

 - Blake

On Tue, Feb 13, 2018 at 12:15 PM, Sebastian Schaffert <
sebastian.schaffert@gmail.com> wrote:

> Hi Blake,
>
> Auto-increment requires querying the database for the next sequence number
> (or the last given ID, depending on the database you use), and that's
> adding another database roundtrip. Snowflake is purely in code, very fast
> to compute, and safe even in distributed setups.
>
> Is it causing problems?
>
> Sebastian
>
> Blake Regalia <bl...@gmail.com> schrieb am Di., 13. Feb. 2018,
> 21:11:
>
>> What was the justification for using the 'snowflake' bigint type for the
>> id fields on nodes, triples and namespaces?
>>
>>
>>  - Blake
>>
>

Re: Reason for not using auto-incrementing id field?

Posted by Sebastian Schaffert <se...@gmail.com>.

Hi Blake,

Auto-increment requires querying the database for the next sequence number
(or the last given ID, depending on the database you use), and that's
adding another database roundtrip. Snowflake is purely in code, very fast
to compute, and safe even in distributed setups.

Is it causing problems?

Sebastian

Blake Regalia <bl...@gmail.com> schrieb am Di., 13. Feb. 2018,
21:11:

> What was the justification for using the 'snowflake' bigint type for the
> id fields on nodes, triples and namespaces?
>
>
>  - Blake
>