You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@devlake.apache.org by Klesh Wong <kl...@apache.org> on 2022/06/13 14:24:45 UTC

[discuss] team entity design

  I meant to post the proposals of Team Entity Design to this mailing 
list, but too much graphical / table and code involved. So I posted it 
on 
https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720 
instead.

  I suggest that every take a look, and either vote for whichever you 
like or propose your solution.


Notice we have 2 TOPICS to decide:

 1. How to aggregate commits by Natural Person, which is prefixed by
    `proposal 1.x`
 2. What should be the Primary Key of the `people` table, which is
    prefixed by `proposal 2.x`

Please reply this email with your favorite proposal options, like:


+1 proposal 1.1

+1 proposal 2.1


PICK ONE OPTION FOR EACH TOPIC

or, post your thoughts.


Thanks


Klesh Wong

Re: [discuss] team entity design => table name

Posted by Kaiyun Zhang <ka...@merico.dev.INVALID>.
We had a discussion about the naming of the tables. There were 2 proposal:
1. identities, unique_identities
2. accounts, users

I personally prefer the 2nd one.

If there are no strong objections from anyone, we will first use the naming for subsequent design and discussion.


> 2022年6月16日 上午9:05,Klesh Wong <kl...@apache.org> 写道:
> 
> I see, make sense to me.
> 
> I would like to use `user` as `unified identity` as well, as I said, we picked `person` over `user` because it is ambiguous, it can be used to refer `unified identity`, or `account` on some platforms, or whoever using Apache DevLake depends on the context.
> 
> I will talk to others and post the result here, thanks for you input, very helpful.
> 
> Best
> 
> Klesh Wong
> 
> On 6/16/22 00:39, Jinglei Ren wrote:
>> OK, if `user` and `account` have been exchanged, there is surely no reason to go back. But why did you need `person`? Now `user` is the “person” you referred to, right? If so, that’s totally fine.
>> 
>> 
>> With that said, let’s clarify the root cause for the mess. Otherwise, more mess will show up in the future. The root cause is less about naming but more about the model. And a high-quality model is so critical for a data product.
>> 
>> We should first define the model very well and then layer names on it – they are not necessarily two separate steps, but it is important to keep this principle in mind. Essentially, whatever names are fine. But you are right, to facilitate everyone to express, we should choose intuitive names.
>> 
>> Take what you said for example: “an `account` might have multiple `users` on one or multiple `platforms` vs. “a `user` might have multiple `accounts` on one or multiple `platforms`.” They are not opposite. They are the same model: A has multiple Bs on one or more platforms. The difference is just A/B is called user or account. I agree the naming can follow your convenience.
>> 
>> So, let me finally confirm with you that the following is the current model and there is no so-called `person` as what you referred to.
>>> 1. `A`: the unified identity on Apache DevLake.
>>> 2. `B`: a website (github.com/gitlab.com/etc...), or abstract domain (git repository, … and the only reliable identity for a git user is email)
>>> 3. `C`: a registration record to represent a user on a B, but an A may or may not map to multiple Cs on a specific B.
>>> Now, what we try to do here is to group those Cs by A… (take git author_email as an example, different emails can belong to one A).
>> If it is confirmed, I have no issue with the current design.
>> 
>> 
>> (A side note: With all the above said, if we could replay the history, I would vote for reserving the word user for platforms but using the word account for the unified identity of Apache DevLake. The mess in your expression was mainly due to unclear definitions and should be resolved in the right way. But I don’t argue for this now as you’ve already made the decision. As long as the model is good, naming per se should not cost more of our time.)
>> 
>> From: Klesh Wong <kl...@apache.org>
>> Date: Wednesday, June 15, 2022 at 10:53 PM
>> To: dev@devlake.apache.org <de...@devlake.apache.org>
>> Subject: Re: [discuss] team entity design => table name
>> I see, yeah, we all agreed that it was better to keep the `users` as it
>> was, and add another entity to represent `unified identity` couple days
>> back.
>> 
>> But it have caused mess during multiple discussions, many of us can't
>> even express himself including myself. so we gave up and agreed that it
>> is better to rename existing `users` to `accounts` for greater good.
>> 
>> The terms you defined, I think it would cause a much much bigger mess
>> for us to express our thoughts, especially myself... -_-!!!
>> 
>> Correct me if I'm wrong, By your definition, a `account` might have
>> multiple `users` on one or multiple `platforms`.
>> 
>> This is the opposite of my cognition: a `user` might have multiple
>> `accounts` on one or multiple `platforms`.
>> 
>> Another reason why we wanted to avoid using `user` is sometimes it
>> refers to the ones using Apache DevLake.
>> 
>> Does it make sense?
>> 
>> 
>> Thanks
>> 
>> Klesh Wong
>> 
>> On 6/15/22 21:14, Jinglei Ren wrote:
>>> The bad smell comes from “a living thing” which the system should not model.
>>> 
>>> We can follow most of your model but (1) merge `person` and `user` in your model and name it `account`; (2) rename the `account` in your model to `user`.
>>> 
>>> The reason for (2) is that, as mentioned in https://github.com/apache/incubator-devlake/issues/1680, “we thought of changing the existing table.users to table.accounts and adding a table.users to represent … natural people, but that will cause many changes in the code.” So, it is good to keep the word `user` for various platforms rather than introduce the `account` in your model.
>>> 
>>> All in all, we can use the new `account` concept and rephrase your model.
>>> 
>>> 1. `account`: the unified identity on Apache DevLake for collecting and analyzing data from different platforms.
>>> 2. `platform`: a website (github.com/gitlab.com/etc...), or abstract domain (git repository, … and the only reliable identity for a git user is email)
>>> 3. `user`: a registration record to represent a user on a `platform`, but an `account` may or may not map to multiple `users` on a specific platform.
>>>    (1) any `account` is always associated with a single user on a single platform (we don't need `account` table)
>>>    (2) some `account` is associated with one user on each of multiple platforms (we need `account` table)
>>>    (3) some `account` is associated with multiple users on multiple platforms (we need `account` table badly)
>>> Now, what we try to do here is to group those `users` by `account`… (take git author_email as
>>> an example, different emails can belong to one `account`).
>>> 
>>> You can see the refined model is simpler than your original one. So, to quickly form consensus, the decision point can be like this: (1) If the above refined model meets the requirements, my understanding should be correct and my irritation with `person` actually leads to better definitions. Then let’s go with it and we won’t spend more time on the word choice of `account`, for example. (2) If the above refined model doesn’t work or misses something, my understanding should be flawed so please just keep to your original model and `person` and ignore this thread.
>>> 
>>> Thanks,
>>> Jinglei
>>> 
>>> From: Klesh Wong <kl...@apache.org>
>>> Date: Wednesday, June 15, 2022 at 2:30 PM
>>> To: dev@devlake.apache.org <de...@devlake.apache.org>
>>> Subject: Re: [discuss] team entity design => table name
>>> Let's bare with existing terms a little bit longer, I don't buy your
>>> definition of `account` just yet. Here is why:
>>> 
>>>   1. `person`: a Living Thing (Human, Dog, or Alien)
>>>   2. `user`: a `person` who is using Apache DevLake to collect and
>>>      analyze DevOps data
>>>   3. `platform`: a website(github.com/gitlab.com/etc...), or abstract
>>>      domain(git repository, it can be cloned to different
>>>      machines/websites, but somehow we treat them the same git repo, and
>>>      the only reliable identity for `person` is email)
>>>   4. `account`: a registration record to represent a `person` on a
>>>      `platform`, but a `person` may or may not have multiple `accounts`
>>>      on a specific platform.
>>>       1. one `person` register on one platform one time and use it
>>>          forever (we don't need `person` table)
>>>       2. one `person` register on multiple platforms one time each and
>>>          use them forever (we need `person` table)
>>>       3. one `person` register on multiple platform multiple time each
>>>          and use some of them (we need `person` table badly)
>>> 
>>> Now, what we try to do here is to group those `accounts` by `person`,
>>> thus, "introduced `person`", and we don't have enough clues to figure
>>> out who is who across multiple platforms, even worst, we can't even
>>> figure out who is who for a specific platform (take git author_email as
>>> an example, different email can belong to one `person`).
>>> 
>>> So, most of us agreed the best way to solve the problem is to aggregate
>>> all those accounts from different platforms into one table named
>>> `accounts`, and then, let `user` connect them to `persons`
>>> 
>>> Hope that explains the situation here.
>>> 
>>> 
>>> Ok, would you mind explaining your idea of how to address the problem by
>>> using only a single table?
>>> 
>>> 
>>> Thanks
>>> 
>>> Klesh Wong
>>> 
>>> On 6/15/22 10:18, Jinglei Ren wrote:
>>>> I am changing the email title to branch out and avoid distracting your main thread. Right, this is not a big deal, so let’s conclude quickly.
>>>> 
>>>> You know, ambiguity can only be resolved by defining the concepts. Otherwise, `persons` do not help either. What I proposed was to just define `accounts` as your previous concept of persons or unified users. The example in your last email was a wrong use of the concept (such as in “we introduce `people` or `persons` or `unified users` to link those `accounts` together” – you still used `account` to refer to Git emails or duplicate Git users.).
>>>> 
>>>> Now let’s switch to the new definition of account. Then there can be two ways to handle a new commit email: (1) we can directly create a new account for it and then later merge it to another account if it is duplicate; (2) the commit emails are just modeled as `emails` or not linked to any account, and they are linked to accounts whenever they can.
>>>> 
>>>> Thanks,
>>>> Jinglei
>>>> 
>>>> From: Klesh Wong<kl...@apache.org>
>>>> Date: Tuesday, June 14, 2022 at 11:52 PM
>>>> To:dev@devlake.apache.org  <de...@devlake.apache.org>
>>>> Subject: Re: [discuss] team entity design
>>>> I'm ok with any name as long as @Julien @Keon @Hezheng are ok with it.
>>>> 
>>>> As of `table.accounts`, I don't understand, how can it represents
>>>> `unified users` while it representing multiple accounts?
>>>> 
>>>> For example, we are collecting `commits` data by `gitextractor`, in
>>>> order to associate a specific `commit` to a specific account, what we
>>>> can do is creating an `account` with `commit.author_email` as PK.  But,
>>>> one might create commits with different email addresses, so we introduce
>>>> `people` or `persons` or `unified users` to link those `accounts` together.
>>>> 
>>>> Thanks,
>>>> 
>>>> Klesh Wong
>>>> 
>>>> On 6/14/22 21:27, Jinglei Ren wrote:
>>>>> Just a comment: `people` should better be `persons` to make it consistent with other plural names as well as `person_teams`, etc.
>>>>> 
>>>>> I see the reasons for this name, but I am still against `people` or `persons` because our system should not model natural persons at all. In some sense, it cannot because you never know if it is a person or a dog :p The key point is that we should consider the concept itself, not just convenience of use.
>>>>> 
>>>>> So, why not keep all types of user names as they are from different data sources and just add `table.accounts` to represent the standard/unified users?
>>>>> 
>>>>> Thanks,
>>>>> Jinglei
>>>>> 
>>>>> From: Klesh Wong<kl...@apache.org>
>>>>> Date: Monday, June 13, 2022 at 10:24 PM
>>>>> To:dev@devlake.apache.org  <de...@devlake.apache.org>
>>>>> Subject: [discuss] team entity design
>>>>>     I meant to post the proposals of Team Entity Design to this mailing
>>>>> list, but too much graphical / table and code involved. So I posted it
>>>>> on
>>>>> https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720
>>>>> instead.
>>>>> 
>>>>>      I suggest that every take a look, and either vote for whichever you
>>>>> like or propose your solution.
>>>>> 
>>>>> 
>>>>> Notice we have 2 TOPICS to decide:
>>>>> 
>>>>>     1. How to aggregate commits by Natural Person, which is prefixed by
>>>>>        `proposal 1.x`
>>>>>     2. What should be the Primary Key of the `people` table, which is
>>>>>        prefixed by `proposal 2.x`
>>>>> 
>>>>> Please reply this email with your favorite proposal options, like:
>>>>> 
>>>>> 
>>>>> +1 proposal 1.1
>>>>> 
>>>>> +1 proposal 2.1
>>>>> 
>>>>> 
>>>>> PICK ONE OPTION FOR EACH TOPIC
>>>>> 
>>>>> or, post your thoughts.
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> 
>>>>> Klesh Wong
>>>>> 


Re: [discuss] team entity design => table name

Posted by Klesh Wong <kl...@apache.org>.
I see, make sense to me.

I would like to use `user` as `unified identity` as well, as I said, we 
picked `person` over `user` because it is ambiguous, it can be used to 
refer `unified identity`, or `account` on some platforms, or whoever 
using Apache DevLake depends on the context.

I will talk to others and post the result here, thanks for you input, 
very helpful.

Best

Klesh Wong

On 6/16/22 00:39, Jinglei Ren wrote:
> OK, if `user` and `account` have been exchanged, there is surely no reason to go back. But why did you need `person`? Now `user` is the “person” you referred to, right? If so, that’s totally fine.
>
>
> With that said, let’s clarify the root cause for the mess. Otherwise, more mess will show up in the future. The root cause is less about naming but more about the model. And a high-quality model is so critical for a data product.
>
> We should first define the model very well and then layer names on it – they are not necessarily two separate steps, but it is important to keep this principle in mind. Essentially, whatever names are fine. But you are right, to facilitate everyone to express, we should choose intuitive names.
>
> Take what you said for example: “an `account` might have multiple `users` on one or multiple `platforms` vs. “a `user` might have multiple `accounts` on one or multiple `platforms`.” They are not opposite. They are the same model: A has multiple Bs on one or more platforms. The difference is just A/B is called user or account. I agree the naming can follow your convenience.
>
> So, let me finally confirm with you that the following is the current model and there is no so-called `person` as what you referred to.
>> 1. `A`: the unified identity on Apache DevLake.
>> 2. `B`: a website (github.com/gitlab.com/etc...), or abstract domain (git repository, … and the only reliable identity for a git user is email)
>> 3. `C`: a registration record to represent a user on a B, but an A may or may not map to multiple Cs on a specific B.
>> Now, what we try to do here is to group those Cs by A… (take git author_email as an example, different emails can belong to one A).
> If it is confirmed, I have no issue with the current design.
>
>
> (A side note: With all the above said, if we could replay the history, I would vote for reserving the word user for platforms but using the word account for the unified identity of Apache DevLake. The mess in your expression was mainly due to unclear definitions and should be resolved in the right way. But I don’t argue for this now as you’ve already made the decision. As long as the model is good, naming per se should not cost more of our time.)
>
> From: Klesh Wong <kl...@apache.org>
> Date: Wednesday, June 15, 2022 at 10:53 PM
> To: dev@devlake.apache.org <de...@devlake.apache.org>
> Subject: Re: [discuss] team entity design => table name
> I see, yeah, we all agreed that it was better to keep the `users` as it
> was, and add another entity to represent `unified identity` couple days
> back.
>
> But it have caused mess during multiple discussions, many of us can't
> even express himself including myself. so we gave up and agreed that it
> is better to rename existing `users` to `accounts` for greater good.
>
> The terms you defined, I think it would cause a much much bigger mess
> for us to express our thoughts, especially myself... -_-!!!
>
> Correct me if I'm wrong, By your definition, a `account` might have
> multiple `users` on one or multiple `platforms`.
>
> This is the opposite of my cognition: a `user` might have multiple
> `accounts` on one or multiple `platforms`.
>
> Another reason why we wanted to avoid using `user` is sometimes it
> refers to the ones using Apache DevLake.
>
> Does it make sense?
>
>
> Thanks
>
> Klesh Wong
>
> On 6/15/22 21:14, Jinglei Ren wrote:
>> The bad smell comes from “a living thing” which the system should not model.
>>
>> We can follow most of your model but (1) merge `person` and `user` in your model and name it `account`; (2) rename the `account` in your model to `user`.
>>
>> The reason for (2) is that, as mentioned in https://github.com/apache/incubator-devlake/issues/1680, “we thought of changing the existing table.users to table.accounts and adding a table.users to represent … natural people, but that will cause many changes in the code.” So, it is good to keep the word `user` for various platforms rather than introduce the `account` in your model.
>>
>> All in all, we can use the new `account` concept and rephrase your model.
>>
>> 1. `account`: the unified identity on Apache DevLake for collecting and analyzing data from different platforms.
>> 2. `platform`: a website (github.com/gitlab.com/etc...), or abstract domain (git repository, … and the only reliable identity for a git user is email)
>> 3. `user`: a registration record to represent a user on a `platform`, but an `account` may or may not map to multiple `users` on a specific platform.
>>     (1) any `account` is always associated with a single user on a single platform (we don't need `account` table)
>>     (2) some `account` is associated with one user on each of multiple platforms (we need `account` table)
>>     (3) some `account` is associated with multiple users on multiple platforms (we need `account` table badly)
>> Now, what we try to do here is to group those `users` by `account`… (take git author_email as
>> an example, different emails can belong to one `account`).
>>
>> You can see the refined model is simpler than your original one. So, to quickly form consensus, the decision point can be like this: (1) If the above refined model meets the requirements, my understanding should be correct and my irritation with `person` actually leads to better definitions. Then let’s go with it and we won’t spend more time on the word choice of `account`, for example. (2) If the above refined model doesn’t work or misses something, my understanding should be flawed so please just keep to your original model and `person` and ignore this thread.
>>
>> Thanks,
>> Jinglei
>>
>> From: Klesh Wong <kl...@apache.org>
>> Date: Wednesday, June 15, 2022 at 2:30 PM
>> To: dev@devlake.apache.org <de...@devlake.apache.org>
>> Subject: Re: [discuss] team entity design => table name
>> Let's bare with existing terms a little bit longer, I don't buy your
>> definition of `account` just yet. Here is why:
>>
>>    1. `person`: a Living Thing (Human, Dog, or Alien)
>>    2. `user`: a `person` who is using Apache DevLake to collect and
>>       analyze DevOps data
>>    3. `platform`: a website(github.com/gitlab.com/etc...), or abstract
>>       domain(git repository, it can be cloned to different
>>       machines/websites, but somehow we treat them the same git repo, and
>>       the only reliable identity for `person` is email)
>>    4. `account`: a registration record to represent a `person` on a
>>       `platform`, but a `person` may or may not have multiple `accounts`
>>       on a specific platform.
>>        1. one `person` register on one platform one time and use it
>>           forever (we don't need `person` table)
>>        2. one `person` register on multiple platforms one time each and
>>           use them forever (we need `person` table)
>>        3. one `person` register on multiple platform multiple time each
>>           and use some of them (we need `person` table badly)
>>
>> Now, what we try to do here is to group those `accounts` by `person`,
>> thus, "introduced `person`", and we don't have enough clues to figure
>> out who is who across multiple platforms, even worst, we can't even
>> figure out who is who for a specific platform (take git author_email as
>> an example, different email can belong to one `person`).
>>
>> So, most of us agreed the best way to solve the problem is to aggregate
>> all those accounts from different platforms into one table named
>> `accounts`, and then, let `user` connect them to `persons`
>>
>> Hope that explains the situation here.
>>
>>
>> Ok, would you mind explaining your idea of how to address the problem by
>> using only a single table?
>>
>>
>> Thanks
>>
>> Klesh Wong
>>
>> On 6/15/22 10:18, Jinglei Ren wrote:
>>> I am changing the email title to branch out and avoid distracting your main thread. Right, this is not a big deal, so let’s conclude quickly.
>>>
>>> You know, ambiguity can only be resolved by defining the concepts. Otherwise, `persons` do not help either. What I proposed was to just define `accounts` as your previous concept of persons or unified users. The example in your last email was a wrong use of the concept (such as in “we introduce `people` or `persons` or `unified users` to link those `accounts` together” – you still used `account` to refer to Git emails or duplicate Git users.).
>>>
>>> Now let’s switch to the new definition of account. Then there can be two ways to handle a new commit email: (1) we can directly create a new account for it and then later merge it to another account if it is duplicate; (2) the commit emails are just modeled as `emails` or not linked to any account, and they are linked to accounts whenever they can.
>>>
>>> Thanks,
>>> Jinglei
>>>
>>> From: Klesh Wong<kl...@apache.org>
>>> Date: Tuesday, June 14, 2022 at 11:52 PM
>>> To:dev@devlake.apache.org  <de...@devlake.apache.org>
>>> Subject: Re: [discuss] team entity design
>>> I'm ok with any name as long as @Julien @Keon @Hezheng are ok with it.
>>>
>>> As of `table.accounts`, I don't understand, how can it represents
>>> `unified users` while it representing multiple accounts?
>>>
>>> For example, we are collecting `commits` data by `gitextractor`, in
>>> order to associate a specific `commit` to a specific account, what we
>>> can do is creating an `account` with `commit.author_email` as PK.  But,
>>> one might create commits with different email addresses, so we introduce
>>> `people` or `persons` or `unified users` to link those `accounts` together.
>>>
>>> Thanks,
>>>
>>> Klesh Wong
>>>
>>> On 6/14/22 21:27, Jinglei Ren wrote:
>>>> Just a comment: `people` should better be `persons` to make it consistent with other plural names as well as `person_teams`, etc.
>>>>
>>>> I see the reasons for this name, but I am still against `people` or `persons` because our system should not model natural persons at all. In some sense, it cannot because you never know if it is a person or a dog :p The key point is that we should consider the concept itself, not just convenience of use.
>>>>
>>>> So, why not keep all types of user names as they are from different data sources and just add `table.accounts` to represent the standard/unified users?
>>>>
>>>> Thanks,
>>>> Jinglei
>>>>
>>>> From: Klesh Wong<kl...@apache.org>
>>>> Date: Monday, June 13, 2022 at 10:24 PM
>>>> To:dev@devlake.apache.org  <de...@devlake.apache.org>
>>>> Subject: [discuss] team entity design
>>>>      I meant to post the proposals of Team Entity Design to this mailing
>>>> list, but too much graphical / table and code involved. So I posted it
>>>> on
>>>> https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720
>>>> instead.
>>>>
>>>>       I suggest that every take a look, and either vote for whichever you
>>>> like or propose your solution.
>>>>
>>>>
>>>> Notice we have 2 TOPICS to decide:
>>>>
>>>>      1. How to aggregate commits by Natural Person, which is prefixed by
>>>>         `proposal 1.x`
>>>>      2. What should be the Primary Key of the `people` table, which is
>>>>         prefixed by `proposal 2.x`
>>>>
>>>> Please reply this email with your favorite proposal options, like:
>>>>
>>>>
>>>> +1 proposal 1.1
>>>>
>>>> +1 proposal 2.1
>>>>
>>>>
>>>> PICK ONE OPTION FOR EACH TOPIC
>>>>
>>>> or, post your thoughts.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> Klesh Wong
>>>>

Re: [discuss] team entity design => table name

Posted by Jinglei Ren <ji...@merico.dev.INVALID>.
OK, if `user` and `account` have been exchanged, there is surely no reason to go back. But why did you need `person`? Now `user` is the “person” you referred to, right? If so, that’s totally fine.


With that said, let’s clarify the root cause for the mess. Otherwise, more mess will show up in the future. The root cause is less about naming but more about the model. And a high-quality model is so critical for a data product.

We should first define the model very well and then layer names on it – they are not necessarily two separate steps, but it is important to keep this principle in mind. Essentially, whatever names are fine. But you are right, to facilitate everyone to express, we should choose intuitive names.

Take what you said for example: “an `account` might have multiple `users` on one or multiple `platforms` vs. “a `user` might have multiple `accounts` on one or multiple `platforms`.” They are not opposite. They are the same model: A has multiple Bs on one or more platforms. The difference is just A/B is called user or account. I agree the naming can follow your convenience.

So, let me finally confirm with you that the following is the current model and there is no so-called `person` as what you referred to.
> 1. `A`: the unified identity on Apache DevLake.
> 2. `B`: a website (github.com/gitlab.com/etc...), or abstract domain (git repository, … and the only reliable identity for a git user is email)
> 3. `C`: a registration record to represent a user on a B, but an A may or may not map to multiple Cs on a specific B.
> Now, what we try to do here is to group those Cs by A… (take git author_email as an example, different emails can belong to one A).

If it is confirmed, I have no issue with the current design.


(A side note: With all the above said, if we could replay the history, I would vote for reserving the word user for platforms but using the word account for the unified identity of Apache DevLake. The mess in your expression was mainly due to unclear definitions and should be resolved in the right way. But I don’t argue for this now as you’ve already made the decision. As long as the model is good, naming per se should not cost more of our time.)

From: Klesh Wong <kl...@apache.org>
Date: Wednesday, June 15, 2022 at 10:53 PM
To: dev@devlake.apache.org <de...@devlake.apache.org>
Subject: Re: [discuss] team entity design => table name
I see, yeah, we all agreed that it was better to keep the `users` as it
was, and add another entity to represent `unified identity` couple days
back.

But it have caused mess during multiple discussions, many of us can't
even express himself including myself. so we gave up and agreed that it
is better to rename existing `users` to `accounts` for greater good.

The terms you defined, I think it would cause a much much bigger mess
for us to express our thoughts, especially myself... -_-!!!

Correct me if I'm wrong, By your definition, a `account` might have
multiple `users` on one or multiple `platforms`.

This is the opposite of my cognition: a `user` might have multiple
`accounts` on one or multiple `platforms`.

Another reason why we wanted to avoid using `user` is sometimes it
refers to the ones using Apache DevLake.

Does it make sense?


Thanks

Klesh Wong

On 6/15/22 21:14, Jinglei Ren wrote:
> The bad smell comes from “a living thing” which the system should not model.
>
> We can follow most of your model but (1) merge `person` and `user` in your model and name it `account`; (2) rename the `account` in your model to `user`.
>
> The reason for (2) is that, as mentioned in https://github.com/apache/incubator-devlake/issues/1680, “we thought of changing the existing table.users to table.accounts and adding a table.users to represent … natural people, but that will cause many changes in the code.” So, it is good to keep the word `user` for various platforms rather than introduce the `account` in your model.
>
> All in all, we can use the new `account` concept and rephrase your model.
>
> 1. `account`: the unified identity on Apache DevLake for collecting and analyzing data from different platforms.
> 2. `platform`: a website (github.com/gitlab.com/etc...), or abstract domain (git repository, … and the only reliable identity for a git user is email)
> 3. `user`: a registration record to represent a user on a `platform`, but an `account` may or may not map to multiple `users` on a specific platform.
>    (1) any `account` is always associated with a single user on a single platform (we don't need `account` table)
>    (2) some `account` is associated with one user on each of multiple platforms (we need `account` table)
>    (3) some `account` is associated with multiple users on multiple platforms (we need `account` table badly)
> Now, what we try to do here is to group those `users` by `account`… (take git author_email as
> an example, different emails can belong to one `account`).
>
> You can see the refined model is simpler than your original one. So, to quickly form consensus, the decision point can be like this: (1) If the above refined model meets the requirements, my understanding should be correct and my irritation with `person` actually leads to better definitions. Then let’s go with it and we won’t spend more time on the word choice of `account`, for example. (2) If the above refined model doesn’t work or misses something, my understanding should be flawed so please just keep to your original model and `person` and ignore this thread.
>
> Thanks,
> Jinglei
>
> From: Klesh Wong <kl...@apache.org>
> Date: Wednesday, June 15, 2022 at 2:30 PM
> To: dev@devlake.apache.org <de...@devlake.apache.org>
> Subject: Re: [discuss] team entity design => table name
> Let's bare with existing terms a little bit longer, I don't buy your
> definition of `account` just yet. Here is why:
>
>   1. `person`: a Living Thing (Human, Dog, or Alien)
>   2. `user`: a `person` who is using Apache DevLake to collect and
>      analyze DevOps data
>   3. `platform`: a website(github.com/gitlab.com/etc...), or abstract
>      domain(git repository, it can be cloned to different
>      machines/websites, but somehow we treat them the same git repo, and
>      the only reliable identity for `person` is email)
>   4. `account`: a registration record to represent a `person` on a
>      `platform`, but a `person` may or may not have multiple `accounts`
>      on a specific platform.
>       1. one `person` register on one platform one time and use it
>          forever (we don't need `person` table)
>       2. one `person` register on multiple platforms one time each and
>          use them forever (we need `person` table)
>       3. one `person` register on multiple platform multiple time each
>          and use some of them (we need `person` table badly)
>
> Now, what we try to do here is to group those `accounts` by `person`,
> thus, "introduced `person`", and we don't have enough clues to figure
> out who is who across multiple platforms, even worst, we can't even
> figure out who is who for a specific platform (take git author_email as
> an example, different email can belong to one `person`).
>
> So, most of us agreed the best way to solve the problem is to aggregate
> all those accounts from different platforms into one table named
> `accounts`, and then, let `user` connect them to `persons`
>
> Hope that explains the situation here.
>
>
> Ok, would you mind explaining your idea of how to address the problem by
> using only a single table?
>
>
> Thanks
>
> Klesh Wong
>
> On 6/15/22 10:18, Jinglei Ren wrote:
>> I am changing the email title to branch out and avoid distracting your main thread. Right, this is not a big deal, so let’s conclude quickly.
>>
>> You know, ambiguity can only be resolved by defining the concepts. Otherwise, `persons` do not help either. What I proposed was to just define `accounts` as your previous concept of persons or unified users. The example in your last email was a wrong use of the concept (such as in “we introduce `people` or `persons` or `unified users` to link those `accounts` together” – you still used `account` to refer to Git emails or duplicate Git users.).
>>
>> Now let’s switch to the new definition of account. Then there can be two ways to handle a new commit email: (1) we can directly create a new account for it and then later merge it to another account if it is duplicate; (2) the commit emails are just modeled as `emails` or not linked to any account, and they are linked to accounts whenever they can.
>>
>> Thanks,
>> Jinglei
>>
>> From: Klesh Wong<kl...@apache.org>
>> Date: Tuesday, June 14, 2022 at 11:52 PM
>> To:dev@devlake.apache.org  <de...@devlake.apache.org>
>> Subject: Re: [discuss] team entity design
>> I'm ok with any name as long as @Julien @Keon @Hezheng are ok with it.
>>
>> As of `table.accounts`, I don't understand, how can it represents
>> `unified users` while it representing multiple accounts?
>>
>> For example, we are collecting `commits` data by `gitextractor`, in
>> order to associate a specific `commit` to a specific account, what we
>> can do is creating an `account` with `commit.author_email` as PK.  But,
>> one might create commits with different email addresses, so we introduce
>> `people` or `persons` or `unified users` to link those `accounts` together.
>>
>> Thanks,
>>
>> Klesh Wong
>>
>> On 6/14/22 21:27, Jinglei Ren wrote:
>>> Just a comment: `people` should better be `persons` to make it consistent with other plural names as well as `person_teams`, etc.
>>>
>>> I see the reasons for this name, but I am still against `people` or `persons` because our system should not model natural persons at all. In some sense, it cannot because you never know if it is a person or a dog :p The key point is that we should consider the concept itself, not just convenience of use.
>>>
>>> So, why not keep all types of user names as they are from different data sources and just add `table.accounts` to represent the standard/unified users?
>>>
>>> Thanks,
>>> Jinglei
>>>
>>> From: Klesh Wong<kl...@apache.org>
>>> Date: Monday, June 13, 2022 at 10:24 PM
>>> To:dev@devlake.apache.org  <de...@devlake.apache.org>
>>> Subject: [discuss] team entity design
>>>     I meant to post the proposals of Team Entity Design to this mailing
>>> list, but too much graphical / table and code involved. So I posted it
>>> on
>>> https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720
>>> instead.
>>>
>>>      I suggest that every take a look, and either vote for whichever you
>>> like or propose your solution.
>>>
>>>
>>> Notice we have 2 TOPICS to decide:
>>>
>>>     1. How to aggregate commits by Natural Person, which is prefixed by
>>>        `proposal 1.x`
>>>     2. What should be the Primary Key of the `people` table, which is
>>>        prefixed by `proposal 2.x`
>>>
>>> Please reply this email with your favorite proposal options, like:
>>>
>>>
>>> +1 proposal 1.1
>>>
>>> +1 proposal 2.1
>>>
>>>
>>> PICK ONE OPTION FOR EACH TOPIC
>>>
>>> or, post your thoughts.
>>>
>>>
>>> Thanks
>>>
>>>
>>> Klesh Wong
>>>

Re: [discuss] team entity design => table name

Posted by Klesh Wong <kl...@apache.org>.
I see, yeah, we all agreed that it was better to keep the `users` as it 
was, and add another entity to represent `unified identity` couple days 
back.

But it have caused mess during multiple discussions, many of us can't 
even express himself including myself. so we gave up and agreed that it 
is better to rename existing `users` to `accounts` for greater good.

The terms you defined, I think it would cause a much much bigger mess 
for us to express our thoughts, especially myself... -_-!!!

Correct me if I'm wrong, By your definition, a `account` might have 
multiple `users` on one or multiple `platforms`.

This is the opposite of my cognition: a `user` might have multiple 
`accounts` on one or multiple `platforms`.

Another reason why we wanted to avoid using `user` is sometimes it 
refers to the ones using Apache DevLake.

Does it make sense?


Thanks

Klesh Wong

On 6/15/22 21:14, Jinglei Ren wrote:
> The bad smell comes from “a living thing” which the system should not model.
>
> We can follow most of your model but (1) merge `person` and `user` in your model and name it `account`; (2) rename the `account` in your model to `user`.
>
> The reason for (2) is that, as mentioned in https://github.com/apache/incubator-devlake/issues/1680, “we thought of changing the existing table.users to table.accounts and adding a table.users to represent … natural people, but that will cause many changes in the code.” So, it is good to keep the word `user` for various platforms rather than introduce the `account` in your model.
>
> All in all, we can use the new `account` concept and rephrase your model.
>
> 1. `account`: the unified identity on Apache DevLake for collecting and analyzing data from different platforms.
> 2. `platform`: a website (github.com/gitlab.com/etc...), or abstract domain (git repository, … and the only reliable identity for a git user is email)
> 3. `user`: a registration record to represent a user on a `platform`, but an `account` may or may not map to multiple `users` on a specific platform.
>    (1) any `account` is always associated with a single user on a single platform (we don't need `account` table)
>    (2) some `account` is associated with one user on each of multiple platforms (we need `account` table)
>    (3) some `account` is associated with multiple users on multiple platforms (we need `account` table badly)
> Now, what we try to do here is to group those `users` by `account`… (take git author_email as
> an example, different emails can belong to one `account`).
>
> You can see the refined model is simpler than your original one. So, to quickly form consensus, the decision point can be like this: (1) If the above refined model meets the requirements, my understanding should be correct and my irritation with `person` actually leads to better definitions. Then let’s go with it and we won’t spend more time on the word choice of `account`, for example. (2) If the above refined model doesn’t work or misses something, my understanding should be flawed so please just keep to your original model and `person` and ignore this thread.
>
> Thanks,
> Jinglei
>
> From: Klesh Wong <kl...@apache.org>
> Date: Wednesday, June 15, 2022 at 2:30 PM
> To: dev@devlake.apache.org <de...@devlake.apache.org>
> Subject: Re: [discuss] team entity design => table name
> Let's bare with existing terms a little bit longer, I don't buy your
> definition of `account` just yet. Here is why:
>
>   1. `person`: a Living Thing (Human, Dog, or Alien)
>   2. `user`: a `person` who is using Apache DevLake to collect and
>      analyze DevOps data
>   3. `platform`: a website(github.com/gitlab.com/etc...), or abstract
>      domain(git repository, it can be cloned to different
>      machines/websites, but somehow we treat them the same git repo, and
>      the only reliable identity for `person` is email)
>   4. `account`: a registration record to represent a `person` on a
>      `platform`, but a `person` may or may not have multiple `accounts`
>      on a specific platform.
>       1. one `person` register on one platform one time and use it
>          forever (we don't need `person` table)
>       2. one `person` register on multiple platforms one time each and
>          use them forever (we need `person` table)
>       3. one `person` register on multiple platform multiple time each
>          and use some of them (we need `person` table badly)
>
> Now, what we try to do here is to group those `accounts` by `person`,
> thus, "introduced `person`", and we don't have enough clues to figure
> out who is who across multiple platforms, even worst, we can't even
> figure out who is who for a specific platform (take git author_email as
> an example, different email can belong to one `person`).
>
> So, most of us agreed the best way to solve the problem is to aggregate
> all those accounts from different platforms into one table named
> `accounts`, and then, let `user` connect them to `persons`
>
> Hope that explains the situation here.
>
>
> Ok, would you mind explaining your idea of how to address the problem by
> using only a single table?
>
>
> Thanks
>
> Klesh Wong
>
> On 6/15/22 10:18, Jinglei Ren wrote:
>> I am changing the email title to branch out and avoid distracting your main thread. Right, this is not a big deal, so let’s conclude quickly.
>>
>> You know, ambiguity can only be resolved by defining the concepts. Otherwise, `persons` do not help either. What I proposed was to just define `accounts` as your previous concept of persons or unified users. The example in your last email was a wrong use of the concept (such as in “we introduce `people` or `persons` or `unified users` to link those `accounts` together” – you still used `account` to refer to Git emails or duplicate Git users.).
>>
>> Now let’s switch to the new definition of account. Then there can be two ways to handle a new commit email: (1) we can directly create a new account for it and then later merge it to another account if it is duplicate; (2) the commit emails are just modeled as `emails` or not linked to any account, and they are linked to accounts whenever they can.
>>
>> Thanks,
>> Jinglei
>>
>> From: Klesh Wong<kl...@apache.org>
>> Date: Tuesday, June 14, 2022 at 11:52 PM
>> To:dev@devlake.apache.org  <de...@devlake.apache.org>
>> Subject: Re: [discuss] team entity design
>> I'm ok with any name as long as @Julien @Keon @Hezheng are ok with it.
>>
>> As of `table.accounts`, I don't understand, how can it represents
>> `unified users` while it representing multiple accounts?
>>
>> For example, we are collecting `commits` data by `gitextractor`, in
>> order to associate a specific `commit` to a specific account, what we
>> can do is creating an `account` with `commit.author_email` as PK.  But,
>> one might create commits with different email addresses, so we introduce
>> `people` or `persons` or `unified users` to link those `accounts` together.
>>
>> Thanks,
>>
>> Klesh Wong
>>
>> On 6/14/22 21:27, Jinglei Ren wrote:
>>> Just a comment: `people` should better be `persons` to make it consistent with other plural names as well as `person_teams`, etc.
>>>
>>> I see the reasons for this name, but I am still against `people` or `persons` because our system should not model natural persons at all. In some sense, it cannot because you never know if it is a person or a dog :p The key point is that we should consider the concept itself, not just convenience of use.
>>>
>>> So, why not keep all types of user names as they are from different data sources and just add `table.accounts` to represent the standard/unified users?
>>>
>>> Thanks,
>>> Jinglei
>>>
>>> From: Klesh Wong<kl...@apache.org>
>>> Date: Monday, June 13, 2022 at 10:24 PM
>>> To:dev@devlake.apache.org  <de...@devlake.apache.org>
>>> Subject: [discuss] team entity design
>>>     I meant to post the proposals of Team Entity Design to this mailing
>>> list, but too much graphical / table and code involved. So I posted it
>>> on
>>> https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720
>>> instead.
>>>
>>>      I suggest that every take a look, and either vote for whichever you
>>> like or propose your solution.
>>>
>>>
>>> Notice we have 2 TOPICS to decide:
>>>
>>>     1. How to aggregate commits by Natural Person, which is prefixed by
>>>        `proposal 1.x`
>>>     2. What should be the Primary Key of the `people` table, which is
>>>        prefixed by `proposal 2.x`
>>>
>>> Please reply this email with your favorite proposal options, like:
>>>
>>>
>>> +1 proposal 1.1
>>>
>>> +1 proposal 2.1
>>>
>>>
>>> PICK ONE OPTION FOR EACH TOPIC
>>>
>>> or, post your thoughts.
>>>
>>>
>>> Thanks
>>>
>>>
>>> Klesh Wong
>>>

Re: [discuss] team entity design => table name

Posted by Jinglei Ren <ji...@merico.dev.INVALID>.
The bad smell comes from “a living thing” which the system should not model.

We can follow most of your model but (1) merge `person` and `user` in your model and name it `account`; (2) rename the `account` in your model to `user`.

The reason for (2) is that, as mentioned in https://github.com/apache/incubator-devlake/issues/1680, “we thought of changing the existing table.users to table.accounts and adding a table.users to represent … natural people, but that will cause many changes in the code.” So, it is good to keep the word `user` for various platforms rather than introduce the `account` in your model.

All in all, we can use the new `account` concept and rephrase your model.

1. `account`: the unified identity on Apache DevLake for collecting and analyzing data from different platforms.
2. `platform`: a website (github.com/gitlab.com/etc...), or abstract domain (git repository, … and the only reliable identity for a git user is email)
3. `user`: a registration record to represent a user on a `platform`, but an `account` may or may not map to multiple `users` on a specific platform.
  (1) any `account` is always associated with a single user on a single platform (we don't need `account` table)
  (2) some `account` is associated with one user on each of multiple platforms (we need `account` table)
  (3) some `account` is associated with multiple users on multiple platforms (we need `account` table badly)
Now, what we try to do here is to group those `users` by `account`… (take git author_email as
an example, different emails can belong to one `account`).

You can see the refined model is simpler than your original one. So, to quickly form consensus, the decision point can be like this: (1) If the above refined model meets the requirements, my understanding should be correct and my irritation with `person` actually leads to better definitions. Then let’s go with it and we won’t spend more time on the word choice of `account`, for example. (2) If the above refined model doesn’t work or misses something, my understanding should be flawed so please just keep to your original model and `person` and ignore this thread.

Thanks,
Jinglei

From: Klesh Wong <kl...@apache.org>
Date: Wednesday, June 15, 2022 at 2:30 PM
To: dev@devlake.apache.org <de...@devlake.apache.org>
Subject: Re: [discuss] team entity design => table name
Let's bare with existing terms a little bit longer, I don't buy your
definition of `account` just yet. Here is why:

 1. `person`: a Living Thing (Human, Dog, or Alien)
 2. `user`: a `person` who is using Apache DevLake to collect and
    analyze DevOps data
 3. `platform`: a website(github.com/gitlab.com/etc...), or abstract
    domain(git repository, it can be cloned to different
    machines/websites, but somehow we treat them the same git repo, and
    the only reliable identity for `person` is email)
 4. `account`: a registration record to represent a `person` on a
    `platform`, but a `person` may or may not have multiple `accounts`
    on a specific platform.
     1. one `person` register on one platform one time and use it
        forever (we don't need `person` table)
     2. one `person` register on multiple platforms one time each and
        use them forever (we need `person` table)
     3. one `person` register on multiple platform multiple time each
        and use some of them (we need `person` table badly)

Now, what we try to do here is to group those `accounts` by `person`,
thus, "introduced `person`", and we don't have enough clues to figure
out who is who across multiple platforms, even worst, we can't even
figure out who is who for a specific platform (take git author_email as
an example, different email can belong to one `person`).

So, most of us agreed the best way to solve the problem is to aggregate
all those accounts from different platforms into one table named
`accounts`, and then, let `user` connect them to `persons`

Hope that explains the situation here.


Ok, would you mind explaining your idea of how to address the problem by
using only a single table?


Thanks

Klesh Wong

On 6/15/22 10:18, Jinglei Ren wrote:
> I am changing the email title to branch out and avoid distracting your main thread. Right, this is not a big deal, so let’s conclude quickly.
>
> You know, ambiguity can only be resolved by defining the concepts. Otherwise, `persons` do not help either. What I proposed was to just define `accounts` as your previous concept of persons or unified users. The example in your last email was a wrong use of the concept (such as in “we introduce `people` or `persons` or `unified users` to link those `accounts` together” – you still used `account` to refer to Git emails or duplicate Git users.).
>
> Now let’s switch to the new definition of account. Then there can be two ways to handle a new commit email: (1) we can directly create a new account for it and then later merge it to another account if it is duplicate; (2) the commit emails are just modeled as `emails` or not linked to any account, and they are linked to accounts whenever they can.
>
> Thanks,
> Jinglei
>
> From: Klesh Wong<kl...@apache.org>
> Date: Tuesday, June 14, 2022 at 11:52 PM
> To:dev@devlake.apache.org  <de...@devlake.apache.org>
> Subject: Re: [discuss] team entity design
> I'm ok with any name as long as @Julien @Keon @Hezheng are ok with it.
>
> As of `table.accounts`, I don't understand, how can it represents
> `unified users` while it representing multiple accounts?
>
> For example, we are collecting `commits` data by `gitextractor`, in
> order to associate a specific `commit` to a specific account, what we
> can do is creating an `account` with `commit.author_email` as PK.  But,
> one might create commits with different email addresses, so we introduce
> `people` or `persons` or `unified users` to link those `accounts` together.
>
> Thanks,
>
> Klesh Wong
>
> On 6/14/22 21:27, Jinglei Ren wrote:
>> Just a comment: `people` should better be `persons` to make it consistent with other plural names as well as `person_teams`, etc.
>>
>> I see the reasons for this name, but I am still against `people` or `persons` because our system should not model natural persons at all. In some sense, it cannot because you never know if it is a person or a dog :p The key point is that we should consider the concept itself, not just convenience of use.
>>
>> So, why not keep all types of user names as they are from different data sources and just add `table.accounts` to represent the standard/unified users?
>>
>> Thanks,
>> Jinglei
>>
>> From: Klesh Wong<kl...@apache.org>
>> Date: Monday, June 13, 2022 at 10:24 PM
>> To:dev@devlake.apache.org  <de...@devlake.apache.org>
>> Subject: [discuss] team entity design
>>    I meant to post the proposals of Team Entity Design to this mailing
>> list, but too much graphical / table and code involved. So I posted it
>> on
>> https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720
>> instead.
>>
>>     I suggest that every take a look, and either vote for whichever you
>> like or propose your solution.
>>
>>
>> Notice we have 2 TOPICS to decide:
>>
>>    1. How to aggregate commits by Natural Person, which is prefixed by
>>       `proposal 1.x`
>>    2. What should be the Primary Key of the `people` table, which is
>>       prefixed by `proposal 2.x`
>>
>> Please reply this email with your favorite proposal options, like:
>>
>>
>> +1 proposal 1.1
>>
>> +1 proposal 2.1
>>
>>
>> PICK ONE OPTION FOR EACH TOPIC
>>
>> or, post your thoughts.
>>
>>
>> Thanks
>>
>>
>> Klesh Wong
>>

Re: [discuss] team entity design => table name

Posted by Klesh Wong <kl...@apache.org>.
Let's bare with existing terms a little bit longer, I don't buy your 
definition of `account` just yet. Here is why:

 1. `person`: a Living Thing (Human, Dog, or Alien)
 2. `user`: a `person` who is using Apache DevLake to collect and
    analyze DevOps data
 3. `platform`: a website(github.com/gitlab.com/etc...), or abstract
    domain(git repository, it can be cloned to different
    machines/websites, but somehow we treat them the same git repo, and
    the only reliable identity for `person` is email)
 4. `account`: a registration record to represent a `person` on a
    `platform`, but a `person` may or may not have multiple `accounts`
    on a specific platform.
     1. one `person` register on one platform one time and use it
        forever (we don't need `person` table)
     2. one `person` register on multiple platforms one time each and
        use them forever (we need `person` table)
     3. one `person` register on multiple platform multiple time each
        and use some of them (we need `person` table badly)

Now, what we try to do here is to group those `accounts` by `person`, 
thus, "introduced `person`", and we don't have enough clues to figure 
out who is who across multiple platforms, even worst, we can't even 
figure out who is who for a specific platform (take git author_email as 
an example, different email can belong to one `person`).

So, most of us agreed the best way to solve the problem is to aggregate 
all those accounts from different platforms into one table named 
`accounts`, and then, let `user` connect them to `persons`

Hope that explains the situation here.


Ok, would you mind explaining your idea of how to address the problem by 
using only a single table?


Thanks

Klesh Wong

On 6/15/22 10:18, Jinglei Ren wrote:
> I am changing the email title to branch out and avoid distracting your main thread. Right, this is not a big deal, so let’s conclude quickly.
>
> You know, ambiguity can only be resolved by defining the concepts. Otherwise, `persons` do not help either. What I proposed was to just define `accounts` as your previous concept of persons or unified users. The example in your last email was a wrong use of the concept (such as in “we introduce `people` or `persons` or `unified users` to link those `accounts` together” – you still used `account` to refer to Git emails or duplicate Git users.).
>
> Now let’s switch to the new definition of account. Then there can be two ways to handle a new commit email: (1) we can directly create a new account for it and then later merge it to another account if it is duplicate; (2) the commit emails are just modeled as `emails` or not linked to any account, and they are linked to accounts whenever they can.
>
> Thanks,
> Jinglei
>
> From: Klesh Wong<kl...@apache.org>
> Date: Tuesday, June 14, 2022 at 11:52 PM
> To:dev@devlake.apache.org  <de...@devlake.apache.org>
> Subject: Re: [discuss] team entity design
> I'm ok with any name as long as @Julien @Keon @Hezheng are ok with it.
>
> As of `table.accounts`, I don't understand, how can it represents
> `unified users` while it representing multiple accounts?
>
> For example, we are collecting `commits` data by `gitextractor`, in
> order to associate a specific `commit` to a specific account, what we
> can do is creating an `account` with `commit.author_email` as PK.  But,
> one might create commits with different email addresses, so we introduce
> `people` or `persons` or `unified users` to link those `accounts` together.
>
> Thanks,
>
> Klesh Wong
>
> On 6/14/22 21:27, Jinglei Ren wrote:
>> Just a comment: `people` should better be `persons` to make it consistent with other plural names as well as `person_teams`, etc.
>>
>> I see the reasons for this name, but I am still against `people` or `persons` because our system should not model natural persons at all. In some sense, it cannot because you never know if it is a person or a dog :p The key point is that we should consider the concept itself, not just convenience of use.
>>
>> So, why not keep all types of user names as they are from different data sources and just add `table.accounts` to represent the standard/unified users?
>>
>> Thanks,
>> Jinglei
>>
>> From: Klesh Wong<kl...@apache.org>
>> Date: Monday, June 13, 2022 at 10:24 PM
>> To:dev@devlake.apache.org  <de...@devlake.apache.org>
>> Subject: [discuss] team entity design
>>    I meant to post the proposals of Team Entity Design to this mailing
>> list, but too much graphical / table and code involved. So I posted it
>> on
>> https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720
>> instead.
>>
>>     I suggest that every take a look, and either vote for whichever you
>> like or propose your solution.
>>
>>
>> Notice we have 2 TOPICS to decide:
>>
>>    1. How to aggregate commits by Natural Person, which is prefixed by
>>       `proposal 1.x`
>>    2. What should be the Primary Key of the `people` table, which is
>>       prefixed by `proposal 2.x`
>>
>> Please reply this email with your favorite proposal options, like:
>>
>>
>> +1 proposal 1.1
>>
>> +1 proposal 2.1
>>
>>
>> PICK ONE OPTION FOR EACH TOPIC
>>
>> or, post your thoughts.
>>
>>
>> Thanks
>>
>>
>> Klesh Wong
>>

Re: [discuss] team entity design => table name

Posted by Jinglei Ren <ji...@merico.dev.INVALID>.
I am changing the email title to branch out and avoid distracting your main thread. Right, this is not a big deal, so let’s conclude quickly.

You know, ambiguity can only be resolved by defining the concepts. Otherwise, `persons` do not help either. What I proposed was to just define `accounts` as your previous concept of persons or unified users. The example in your last email was a wrong use of the concept (such as in “we introduce `people` or `persons` or `unified users` to link those `accounts` together” – you still used `account` to refer to Git emails or duplicate Git users.).

Now let’s switch to the new definition of account. Then there can be two ways to handle a new commit email: (1) we can directly create a new account for it and then later merge it to another account if it is duplicate; (2) the commit emails are just modeled as `emails` or not linked to any account, and they are linked to accounts whenever they can.

Thanks,
Jinglei

From: Klesh Wong <kl...@apache.org>
Date: Tuesday, June 14, 2022 at 11:52 PM
To: dev@devlake.apache.org <de...@devlake.apache.org>
Subject: Re: [discuss] team entity design
I'm ok with any name as long as @Julien @Keon @Hezheng are ok with it.

As of `table.accounts`, I don't understand, how can it represents
`unified users` while it representing multiple accounts?

For example, we are collecting `commits` data by `gitextractor`, in
order to associate a specific `commit` to a specific account, what we
can do is creating an `account` with `commit.author_email` as PK.  But,
one might create commits with different email addresses, so we introduce
`people` or `persons` or `unified users` to link those `accounts` together.

Thanks,

Klesh Wong

On 6/14/22 21:27, Jinglei Ren wrote:
> Just a comment: `people` should better be `persons` to make it consistent with other plural names as well as `person_teams`, etc.
>
> I see the reasons for this name, but I am still against `people` or `persons` because our system should not model natural persons at all. In some sense, it cannot because you never know if it is a person or a dog :p The key point is that we should consider the concept itself, not just convenience of use.
>
> So, why not keep all types of user names as they are from different data sources and just add `table.accounts` to represent the standard/unified users?
>
> Thanks,
> Jinglei
>
> From: Klesh Wong <kl...@apache.org>
> Date: Monday, June 13, 2022 at 10:24 PM
> To: dev@devlake.apache.org <de...@devlake.apache.org>
> Subject: [discuss] team entity design
>   I meant to post the proposals of Team Entity Design to this mailing
> list, but too much graphical / table and code involved. So I posted it
> on
> https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720
> instead.
>
>    I suggest that every take a look, and either vote for whichever you
> like or propose your solution.
>
>
> Notice we have 2 TOPICS to decide:
>
>   1. How to aggregate commits by Natural Person, which is prefixed by
>      `proposal 1.x`
>   2. What should be the Primary Key of the `people` table, which is
>      prefixed by `proposal 2.x`
>
> Please reply this email with your favorite proposal options, like:
>
>
> +1 proposal 1.1
>
> +1 proposal 2.1
>
>
> PICK ONE OPTION FOR EACH TOPIC
>
> or, post your thoughts.
>
>
> Thanks
>
>
> Klesh Wong
>

Re: [discuss] team entity design

Posted by Hezheng Yin <yi...@gmail.com>.
Hi Klesh,

Thanks for initiating the discussion!

On the *first topic* of how to find a developer's commits when they commit
with multiple emails/names, I looked up GitHub's documentation
<https://docs.github.com/en/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-email-preferences/setting-your-commit-email-address>
on those "no-reply" emails:

[image: Screen Shot 2022-06-14 at 4.44.58 PM.png]

So it seems developers will get different no-reply email addresses from
GitHub. If this is the case, then we're not too concerned with many
developers sharing the same "no-reply@github.com" email problem, and
proposal 1.1 should work as long as the git extractor plugin actually
creates records in the `accounts` table (afaik, the git extractor plugin
doesn't populate `accounts` table right now).

On the *second topic* of deciding the primary key of the `people` table (or
`persons` table as Jinglei suggested), I agree that Auto Increment ID
(proposal 2.1) is not a good idea, but not for the reasons listed in the
GitHub issue.


   1. "Tables have dependency to each other" => Many tables do depend on
   other tables in DevLake plugins, for example, `issue_changelogs` are
   populated after the `issues` table. Whether a dependency is desirable or
   not needs to be judged case-by-case.
   2. "It is harder to work with, when users try to link accounts together,
   either one has to memorize the mapping or check a dictionary" => This is
   true, but proposal 2.2 also suffers from this since a real-world `account`
   id from Jira might look like "Jira:JiraUser:1:73828" instead of having
   email embedded in it. Users would have no way to tell who that `account`
   represents without additional checks.

My objection to Auto Increment ID is that it's not idempotent, which you
mentioned in a separate thread. Idempotency is important for DevLake to
re-run plugins or re-build tables from scratch safely.

Proposal 2.2's problem is that it introduces a dependency between the
`accounts` table and the `persons` table. Since `person_id` is picked from
`account_id`, DevLake has to create the `persons` table after the
`accounts` table. This makes it impossible for users to import data into
the `persons` table (e.g., from their HR system) before importing data into
the `accounts` table.

Based on the above reasons, I'm leaning towards proposal 2.3.


On Tue, Jun 14, 2022 at 8:52 AM Klesh Wong <kl...@apache.org> wrote:

> I'm ok with any name as long as @Julien @Keon @Hezheng are ok with it.
>
> As of `table.accounts`, I don't understand, how can it represents
> `unified users` while it representing multiple accounts?
>
> For example, we are collecting `commits` data by `gitextractor`, in
> order to associate a specific `commit` to a specific account, what we
> can do is creating an `account` with `commit.author_email` as PK.  But,
> one might create commits with different email addresses, so we introduce
> `people` or `persons` or `unified users` to link those `accounts` together.
>
> Thanks,
>
> Klesh Wong
>
> On 6/14/22 21:27, Jinglei Ren wrote:
> > Just a comment: `people` should better be `persons` to make it
> consistent with other plural names as well as `person_teams`, etc.
> >
> > I see the reasons for this name, but I am still against `people` or
> `persons` because our system should not model natural persons at all. In
> some sense, it cannot because you never know if it is a person or a dog :p
> The key point is that we should consider the concept itself, not just
> convenience of use.
> >
> > So, why not keep all types of user names as they are from different data
> sources and just add `table.accounts` to represent the standard/unified
> users?
> >
> > Thanks,
> > Jinglei
> >
> > From: Klesh Wong <kl...@apache.org>
> > Date: Monday, June 13, 2022 at 10:24 PM
> > To: dev@devlake.apache.org <de...@devlake.apache.org>
> > Subject: [discuss] team entity design
> >   I meant to post the proposals of Team Entity Design to this mailing
> > list, but too much graphical / table and code involved. So I posted it
> > on
> >
> https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720
> > instead.
> >
> >    I suggest that every take a look, and either vote for whichever you
> > like or propose your solution.
> >
> >
> > Notice we have 2 TOPICS to decide:
> >
> >   1. How to aggregate commits by Natural Person, which is prefixed by
> >      `proposal 1.x`
> >   2. What should be the Primary Key of the `people` table, which is
> >      prefixed by `proposal 2.x`
> >
> > Please reply this email with your favorite proposal options, like:
> >
> >
> > +1 proposal 1.1
> >
> > +1 proposal 2.1
> >
> >
> > PICK ONE OPTION FOR EACH TOPIC
> >
> > or, post your thoughts.
> >
> >
> > Thanks
> >
> >
> > Klesh Wong
> >
>

Re: [discuss] team entity design

Posted by Klesh Wong <kl...@apache.org>.
I'm ok with any name as long as @Julien @Keon @Hezheng are ok with it.

As of `table.accounts`, I don't understand, how can it represents  
`unified users` while it representing multiple accounts?

For example, we are collecting `commits` data by `gitextractor`, in 
order to associate a specific `commit` to a specific account, what we 
can do is creating an `account` with `commit.author_email` as PK.  But, 
one might create commits with different email addresses, so we introduce 
`people` or `persons` or `unified users` to link those `accounts` together.

Thanks,

Klesh Wong

On 6/14/22 21:27, Jinglei Ren wrote:
> Just a comment: `people` should better be `persons` to make it consistent with other plural names as well as `person_teams`, etc.
>
> I see the reasons for this name, but I am still against `people` or `persons` because our system should not model natural persons at all. In some sense, it cannot because you never know if it is a person or a dog :p The key point is that we should consider the concept itself, not just convenience of use.
>
> So, why not keep all types of user names as they are from different data sources and just add `table.accounts` to represent the standard/unified users?
>
> Thanks,
> Jinglei
>
> From: Klesh Wong <kl...@apache.org>
> Date: Monday, June 13, 2022 at 10:24 PM
> To: dev@devlake.apache.org <de...@devlake.apache.org>
> Subject: [discuss] team entity design
>   I meant to post the proposals of Team Entity Design to this mailing
> list, but too much graphical / table and code involved. So I posted it
> on
> https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720
> instead.
>
>    I suggest that every take a look, and either vote for whichever you
> like or propose your solution.
>
>
> Notice we have 2 TOPICS to decide:
>
>   1. How to aggregate commits by Natural Person, which is prefixed by
>      `proposal 1.x`
>   2. What should be the Primary Key of the `people` table, which is
>      prefixed by `proposal 2.x`
>
> Please reply this email with your favorite proposal options, like:
>
>
> +1 proposal 1.1
>
> +1 proposal 2.1
>
>
> PICK ONE OPTION FOR EACH TOPIC
>
> or, post your thoughts.
>
>
> Thanks
>
>
> Klesh Wong
>

Re: [discuss] team entity design

Posted by Jinglei Ren <ji...@merico.dev.INVALID>.
Just a comment: `people` should better be `persons` to make it consistent with other plural names as well as `person_teams`, etc.

I see the reasons for this name, but I am still against `people` or `persons` because our system should not model natural persons at all. In some sense, it cannot because you never know if it is a person or a dog :p The key point is that we should consider the concept itself, not just convenience of use.

So, why not keep all types of user names as they are from different data sources and just add `table.accounts` to represent the standard/unified users?

Thanks,
Jinglei

From: Klesh Wong <kl...@apache.org>
Date: Monday, June 13, 2022 at 10:24 PM
To: dev@devlake.apache.org <de...@devlake.apache.org>
Subject: [discuss] team entity design
 I meant to post the proposals of Team Entity Design to this mailing
list, but too much graphical / table and code involved. So I posted it
on
https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720
instead.

  I suggest that every take a look, and either vote for whichever you
like or propose your solution.


Notice we have 2 TOPICS to decide:

 1. How to aggregate commits by Natural Person, which is prefixed by
    `proposal 1.x`
 2. What should be the Primary Key of the `people` table, which is
    prefixed by `proposal 2.x`

Please reply this email with your favorite proposal options, like:


+1 proposal 1.1

+1 proposal 2.1


PICK ONE OPTION FOR EACH TOPIC

or, post your thoughts.


Thanks


Klesh Wong

回复: [discuss] team entity design

Posted by Yingchu Chen <Yi...@merico.dev.INVALID>.
+1 Proposal 2.3
+1 Proposal 1.1 From what I know, in commits, the author_id is the same as author_email
________________________________
发件人: Klesh Wong <kl...@apache.org>
发送时间: 2022年6月13日 22:24
收件人: dev@devlake.apache.org <de...@devlake.apache.org>
主题: [discuss] team entity design

 I meant to post the proposals of Team Entity Design to this mailing
list, but too much graphical / table and code involved. So I posted it
on
https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720
instead.

  I suggest that every take a look, and either vote for whichever you
like or propose your solution.


Notice we have 2 TOPICS to decide:

 1. How to aggregate commits by Natural Person, which is prefixed by
    `proposal 1.x`
 2. What should be the Primary Key of the `people` table, which is
    prefixed by `proposal 2.x`

Please reply this email with your favorite proposal options, like:


+1 proposal 1.1

+1 proposal 2.1


PICK ONE OPTION FOR EACH TOPIC

or, post your thoughts.


Thanks


Klesh Wong

Re: [discuss] team entity design

Posted by Kaiyun Zhang <ka...@merico.dev.INVALID>.
It seems that most committers prefer 'user journey 3' over 'user journey 2', although there’re overlaps.

If there are no objections, we will start supporting 'user journey 3’. Technical decisions made in the future should be able to support this flow.

Cheers.

Re: [discuss] team entity design

Posted by Kaiyun Zhang <ka...@merico.dev.INVALID>.
Hi team,


From a product point of view, there might be 3 possible user journeys.



User Journey 1

  1.  User downloads a CSV file that contains all the data from the accounts table, except with an extra field named person_id
  2.  User goes through each record, makes sure the person_ids were correctly filled
     *   User may sort the spreadsheet by email, set a unique id for a group of consecutive account records
     *   User may then sort records by name, to deal with records without email
  3.  User upload CSV file and be done with it

This is a user journey without pre-mapping logic, which means a user has to do ‘account-person-team’ mapping from scratch. It's written by @Klesh and is under 'proposal 2.2' in issue #1680 (https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720). I pasted it here for convenience.


User Journey  2
Similar to user journey 1, except the CSV file downloaded in the step 1 contains person_id generated by the pre-mapping logic based on ONLY the records in table ‘accounts’.
The pre-mapping logic would be: accounts with the same emails, and the accounts with the same name.


User Journey  3


  1.  User uploads a CSV file or uses DevLake to retrieve 'persons' information directly from their HR system. As a result, table ‘persons' and ‘person_teams’ are created.
     *   The 'persons' table SHOULD include each natural person’s name and team.
     *   The 'persons' table MAY contain a unique identifier of each person if it’s got from an HR system.
     *   After it has been uploaded, DevLake would do auto-mapping based on BOTH the records in table ‘accounts' and in table ‘persons’.
  2.  User downloads a file that contains all the data from the ‘accounts' table, with extra fields namedperson_id,person_email, person_name populated by the auto-mapping logic.
  3.  User goes through each record, makes sure the person_ids are correctly filled.
     *   person_email, person_name are here for reference if person_id doesn’t have a clear meaning. Eg. When person_ids are UIDs of an HR system.
     *   User may search for accounts without person_ids, copy and paste the sheet to an online doc and let the company to fill it out.
  4.  User upload CSV file and be done with it

Compare user journey 2 with 3

I broke down the whole ‘account-person-team’ mapping task to several sub-tasks, and compared its difficulty in journey 2 and 3
1. Pre-requisite work

  1.  Journey 2: no
  2.  Journey 3: uploads a CSV of ‘persons' info

2. The difference in ‘account-person’ auto-mapping results

  1.  Journey 2: Users have to do more changes. ‘accounts’ with pre-populated teams are less than them in Journey 3. As there’re no 'persons' for the pre-mapping logic to reference, it's also less accurate.
  2.  Journey 3: Users can do less changes.

3. The difficulty to manually finish account-person mapping

  1.  Journey 2: users in journey 2 would have to keep looking elsewhere for this ‘persons’ and ‘person_teams' as a reference when filling out the ‘accounts’ CSV with a field person_id. Without this reference, users in journey 2 might aggregate different accounts of the same person into different ‘persons’, which will affect the final stats.FALSE; or use SQL to mark the person as deleted in table ‘persons’.
  2.  Journey 3: Users in journey 3 have a more obvious expectation (that a person_id maps to an account_id is in the 'persons' table) and a more sense of control by having a defined ‘persons’ and ‘person_teams’ info as a reference.
  3.  For example, there’re 5 account_ids in the table below, 5 accounts all belong to the same person with person_name = ‘Zhenmian Huang’ and person_primary_email = ‘zhenmian.huang@merico.dev<ma...@merico.dev>’. User in journey 3 knows that there’s only ONE person with person_name = ‘Zhenmian Huang’ and person_primary_email = ‘zhenmian.huang@merico.dev<ma...@merico.dev>’ exists. Therefore, when they’re manually mapping account_id 'github:GithubUser:1:{GitHub_user_id_1}’ to a person_id. He’ll find that the ‘Klesh Wong’ might be one of the existing persons’ nickname. And then the user is more likely to ask his/her colleague to find out who is ‘Klesh Wong’.
  4.  While user in journey 2 doesn’t know that there’s only ONE person with person_name = ‘Zhenmian Huang’ and person_primary_email = ‘zhenmian.huang@merico.dev<ma...@merico.dev>’. When he/she is mapping each account_id to a person_id, he/she may think 'github:GithubUser:1:{GitHub_user_id_2}’ and 'kleshwong@gmail.com<ma...@gmail.com>’ are another person with person_name = ‘Klesh Wong’ and person_primary_email = ‘kleshwong@gmail.com<ma...@gmail.com>’.

id      name    email   ....
jira:JiraUser:1:asdf    Zhenmian Huang          jira cloud
jira:JiraUser:1:zhenmian.huang@merico.dev<ma...@merico.dev>     Zhenmian Huang  zhenmian.huang@merico.dev<ma...@merico.dev>     jira server
github:GithubUser:1:{GitHub_user_id_1}  Zhenmian Huang  zhenmian.huang@merico.dev<ma...@merico.dev>     github account1
github:GithubUser:1:{GitHub_user_id_2}  Klesh Wong      kleshwong@gmail.com<ma...@gmail.com> github account2
kleshwong@gmail.com<ma...@gmail.com> Klesh Wong      kleshwong@gmail.com<ma...@gmail.com> git

4. New people come to the org./company

  1.  Journey 2: download a CSV of ‘accounts’ and redo journey 2
  2.  Journey 3: re-upload a CSV of ‘persons’ info and redo journey 3

5. Move people to a new team

  1.  Journey 2: manually correct every ‘team’ associated with the same person_id, or use SQL to update table ‘person_teams’ if there’re not many people coming.
  2.  Journey 3: re-upload a CSV of ‘persons’ info, or use SQL to update table ‘person_teams’ if there’re not many people coming.

6. People leave the company. Users do not want to show these people's stats

  1.  Journey 2: another field ‘is_active’ should be added in the ‘accounts’ CSV, users have to manually change these people’s ‘is_active' to FALSE; or use SQL to mark the person as deleted in table ‘persons’.
  2.  Journey 3: users might need to upload a CSV of all ‘persons’ with the people’s ‘is_active’ status.




Conclusion
User journey 2 & 3 both have pros and cons. No one is significantly better than the other.

To ensure that both user journey work in the future, the technical solution is better to support both Journey 2 and Journey 3 in the long-term.

However, if we’re only going to choose a technical solution to support ONLY ONE user journey in the short term, I personally choose journey 3 because of the 3rd point above - ‘the difficulty to manually finish account-person mapping’ is easier in journey 3 than in journey 2.







2022年6月13日 下午10:24,Klesh Wong <kl...@apache.org>> 写道:

 I meant to post the proposals of Team Entity Design to this mailing list, but too much graphical / table and code involved. So I posted it on https://github.com/apache/incubator-devlake/issues/1680#issuecomment-1153588720 instead.

 I suggest that every take a look, and either vote for whichever you like or propose your solution.


Notice we have 2 TOPICS to decide:

1. How to aggregate commits by Natural Person, which is prefixed by
  `proposal 1.x`
2. What should be the Primary Key of the `people` table, which is
  prefixed by `proposal 2.x`

Please reply this email with your favorite proposal options, like:


+1 proposal 1.1

+1 proposal 2.1


PICK ONE OPTION FOR EACH TOPIC

or, post your thoughts.


Thanks


Klesh Wong