You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@systemds.apache.org by Markus Reiter-Haas <is...@gmail.com> on 2021/02/21 16:26:27 UTC

PR-specific Question on Tokenizer regarding Number of non-zerosmismatch

Dear SystemDS developers!

I have created a reference implementation for a tokenizer in
https://github.com/apache/systemds/pull/1169 .
There is one consideration I would like to get some input on.

When representing the tokens in long-format (i.e., a transformation
that expands on rows (rows: n, maxTokens: m, idCols: k) -> (m*n,
k+2)), I get the message in a follow-up `transformencode`:
         Job aborted due to stage failure: Task 0 in stage 10.0 failed
1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 18,
localhost, executor driver):
org.apache.sysds.runtime.DMLRuntimeException: Number of non-zeros
mismatch on merge disjoint (target=1000x4, nnz target=4000, nnz
source=3992)
Unfortunately, I have not been able to fix this bug since it does not
occur in the `tokenize` itself.
However, I have since implemented a wide-format (i.e., a
transformation that expands on columns (rows: n, maxTokens: m, idCols:
k) -> (n, m+k)), where I could not reproduce the issue. The current
state of the PR uses this format in the test cases and passes all
checks.

My specific questions are:
1. Does anyone know what the issue could be or how it could be fixed?
2. Conversely, why does the issue not occur on the wide-format? (I
want to ensure that the code indeed works and not just hides the
error)
3. Should I drop the support for the long-format to circumvent the issue?

Thanks and best regards,
Markus

-- 
Dipl.-Ing. Markus Reiter-Haas, BSc

University Assistant, PhD Student
Social Computing Lab
Institute of Interactive Systems and Data Science, TU Graz

Re: PR-specific Question on Tokenizer regarding Number of non-zerosmismatch

Posted by Matthias Boehm <mb...@gmail.com>.

please only comment the spark tests so the testsuite runs through, and 
I'll use the existing tests to reproduce the issue in the next weeks and 
fix if necessary (before our SystemDS 2.1 release).

Regards,
Matthias

On 2/21/2021 6:28 PM, Markus Reiter-Haas wrote:
> Thanks for the quick response, and also thanks for the tip regarding
> the wide-format.
> 
> Regarding keeping only the single-node implementation. Should I remove
> the Spark execution altogether? Or implement a kind of switch to
> disable it? Or only remove the tests for it?
> I am asking since parts of the tokenizer code are only relevant for
> the Spark execution (i.e., estimating the dimensions and using a max
> number of tokens) and influenced some design decisions (e.g., not
> creating a lookup dictionary).
> 
> Best regards,
> Markus
> 
> On Sun, Feb 21, 2021 at 6:04 PM Matthias Boehm <mb...@gmail.com> wrote:
>>
>> thanks for asking here, I would separate a couple of things:
>>
>> 1) There is always a chance of bugs, so let's merge in the singlenode
>> implementation and disable the Spark tests for this feature. Then we can
>> try to reproduce if there are still issues and help fix it.
>>
>> 2) The error is raised whenever we merge two blocks A and B into C, and
>> the number of non-zeros nnz(C) != nnz(A) + nnz(B) or length(C) < nnz(A)
>> + nnz(B), because the merge is supposed to merge disjoint cells. This
>> error usually means that either blocks got wrong block ids and thus were
>> incorrectly merged with other blocks (and thus overwrote cells), or the
>> previous operations did not maintain the NNZs correctly.
>>
>> 3) Keep both and we'll help figure it out.
>>
>> Regards,
>> Matthias
>>
>> On 2/21/2021 5:26 PM, Markus Reiter-Haas wrote:
>>> Dear SystemDS developers!
>>>
>>> I have created a reference implementation for a tokenizer in
>>> https://github.com/apache/systemds/pull/1169 .
>>> There is one consideration I would like to get some input on.
>>>
>>> When representing the tokens in long-format (i.e., a transformation
>>> that expands on rows (rows: n, maxTokens: m, idCols: k) -> (m*n,
>>> k+2)), I get the message in a follow-up `transformencode`:
>>>            Job aborted due to stage failure: Task 0 in stage 10.0 failed
>>> 1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 18,
>>> localhost, executor driver):
>>> org.apache.sysds.runtime.DMLRuntimeException: Number of non-zeros
>>> mismatch on merge disjoint (target=1000x4, nnz target=4000, nnz
>>> source=3992)
>>> Unfortunately, I have not been able to fix this bug since it does not
>>> occur in the `tokenize` itself.
>>> However, I have since implemented a wide-format (i.e., a
>>> transformation that expands on columns (rows: n, maxTokens: m, idCols:
>>> k) -> (n, m+k)), where I could not reproduce the issue. The current
>>> state of the PR uses this format in the test cases and passes all
>>> checks.
>>>
>>> My specific questions are:
>>> 1. Does anyone know what the issue could be or how it could be fixed?
>>> 2. Conversely, why does the issue not occur on the wide-format? (I
>>> want to ensure that the code indeed works and not just hides the
>>> error)
>>> 3. Should I drop the support for the long-format to circumvent the issue?
>>>
>>> Thanks and best regards,
>>> Markus
>>>

Re: PR-specific Question on Tokenizer regarding Number of non-zerosmismatch

Posted by Markus Reiter-Haas <is...@gmail.com>.

Thanks for the quick response, and also thanks for the tip regarding
the wide-format.

Regarding keeping only the single-node implementation. Should I remove
the Spark execution altogether? Or implement a kind of switch to
disable it? Or only remove the tests for it?
I am asking since parts of the tokenizer code are only relevant for
the Spark execution (i.e., estimating the dimensions and using a max
number of tokens) and influenced some design decisions (e.g., not
creating a lookup dictionary).

Best regards,
Markus

On Sun, Feb 21, 2021 at 6:04 PM Matthias Boehm <mb...@gmail.com> wrote:
>
> thanks for asking here, I would separate a couple of things:
>
> 1) There is always a chance of bugs, so let's merge in the singlenode
> implementation and disable the Spark tests for this feature. Then we can
> try to reproduce if there are still issues and help fix it.
>
> 2) The error is raised whenever we merge two blocks A and B into C, and
> the number of non-zeros nnz(C) != nnz(A) + nnz(B) or length(C) < nnz(A)
> + nnz(B), because the merge is supposed to merge disjoint cells. This
> error usually means that either blocks got wrong block ids and thus were
> incorrectly merged with other blocks (and thus overwrote cells), or the
> previous operations did not maintain the NNZs correctly.
>
> 3) Keep both and we'll help figure it out.
>
> Regards,
> Matthias
>
> On 2/21/2021 5:26 PM, Markus Reiter-Haas wrote:
> > Dear SystemDS developers!
> >
> > I have created a reference implementation for a tokenizer in
> > https://github.com/apache/systemds/pull/1169 .
> > There is one consideration I would like to get some input on.
> >
> > When representing the tokens in long-format (i.e., a transformation
> > that expands on rows (rows: n, maxTokens: m, idCols: k) -> (m*n,
> > k+2)), I get the message in a follow-up `transformencode`:
> >           Job aborted due to stage failure: Task 0 in stage 10.0 failed
> > 1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 18,
> > localhost, executor driver):
> > org.apache.sysds.runtime.DMLRuntimeException: Number of non-zeros
> > mismatch on merge disjoint (target=1000x4, nnz target=4000, nnz
> > source=3992)
> > Unfortunately, I have not been able to fix this bug since it does not
> > occur in the `tokenize` itself.
> > However, I have since implemented a wide-format (i.e., a
> > transformation that expands on columns (rows: n, maxTokens: m, idCols:
> > k) -> (n, m+k)), where I could not reproduce the issue. The current
> > state of the PR uses this format in the test cases and passes all
> > checks.
> >
> > My specific questions are:
> > 1. Does anyone know what the issue could be or how it could be fixed?
> > 2. Conversely, why does the issue not occur on the wide-format? (I
> > want to ensure that the code indeed works and not just hides the
> > error)
> > 3. Should I drop the support for the long-format to circumvent the issue?
> >
> > Thanks and best regards,
> > Markus
> >

Re: PR-specific Question on Tokenizer regarding Number of non-zerosmismatch

Posted by Matthias Boehm <mb...@gmail.com>.

thanks for asking here, I would separate a couple of things:

1) There is always a chance of bugs, so let's merge in the singlenode 
implementation and disable the Spark tests for this feature. Then we can 
try to reproduce if there are still issues and help fix it.

2) The error is raised whenever we merge two blocks A and B into C, and 
the number of non-zeros nnz(C) != nnz(A) + nnz(B) or length(C) < nnz(A) 
+ nnz(B), because the merge is supposed to merge disjoint cells. This 
error usually means that either blocks got wrong block ids and thus were 
incorrectly merged with other blocks (and thus overwrote cells), or the 
previous operations did not maintain the NNZs correctly.

3) Keep both and we'll help figure it out.

Regards,
Matthias

On 2/21/2021 5:26 PM, Markus Reiter-Haas wrote:
> Dear SystemDS developers!
> 
> I have created a reference implementation for a tokenizer in
> https://github.com/apache/systemds/pull/1169 .
> There is one consideration I would like to get some input on.
> 
> When representing the tokens in long-format (i.e., a transformation
> that expands on rows (rows: n, maxTokens: m, idCols: k) -> (m*n,
> k+2)), I get the message in a follow-up `transformencode`:
>           Job aborted due to stage failure: Task 0 in stage 10.0 failed
> 1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 18,
> localhost, executor driver):
> org.apache.sysds.runtime.DMLRuntimeException: Number of non-zeros
> mismatch on merge disjoint (target=1000x4, nnz target=4000, nnz
> source=3992)
> Unfortunately, I have not been able to fix this bug since it does not
> occur in the `tokenize` itself.
> However, I have since implemented a wide-format (i.e., a
> transformation that expands on columns (rows: n, maxTokens: m, idCols:
> k) -> (n, m+k)), where I could not reproduce the issue. The current
> state of the PR uses this format in the test cases and passes all
> checks.
> 
> My specific questions are:
> 1. Does anyone know what the issue could be or how it could be fixed?
> 2. Conversely, why does the issue not occur on the wide-format? (I
> want to ensure that the code indeed works and not just hides the
> error)
> 3. Should I drop the support for the long-format to circumvent the issue?
> 
> Thanks and best regards,
> Markus
>