You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by Qing Lan <la...@live.com> on 2018/09/19 18:04:37 UTC

Some feedback from MXNet Zhihu topic

Hi all,

There was a trend topic<https://www.zhihu.com/question/293996867> in Zhihu (a famous Chinese Stackoverflow+Quora) asking about the status of MXNet in 2018 recently. Mu replied the thread and obtained more than 300+ `like`.
However there are a few concerns addressed in the comments of this thread, I have done some simple translation from Chinese to English:

1. Documentations! Until now, the online doc still contains:
1. Depreciated but not updated doc
2. Wrong documentation with poor description
3. Document in Alpha stage such as you must install `pip –pre` in order to run.

2. Examples! For Gluon specifically, many examples are still mixing Gluon/MXNet apis. The mixure of mx.sym, mx.nd mx.gluon confused the users of what is the right one to choose in order to get their model to work. As an example, Although Gluon made data encapsulation possible, still there are examples using mxn.io.ImageRecordIter with tens of params (feels like gluon examples are simply the copy from old Python examples).

3. Examples again! Comparing to PyTorch, there are a few examples I don't like in Gluon:
1. Available to run however the code structure is still very complicated. Such as example/image-classification/cifar10.py. It seemed like a consecutive code concatenation. In fact, these are just a series of layers mixed with model.fit. It makes user very hard to modify/extend the model.
2. Only available to run with certain settings. If users try to change a little bit in the model, crashes will happen. For example, the multi-gpu example in Gluon website, MXNet hide the logic that using batch size to change learning rate in a optimizer. A lot of newbies didn't know this fact and they would only find that the model stopped converging when batch size changed.
3. The worst scenario is the model itself just simply didn't work. Maintainers in the MXNet community didn't run the model (even no integration test) and merge the code directly. It makes the script not able run till somebody raise the issues and fix it.

4. The Community problem. The core advantage for MXNet is it's scalability and efficiency. However, the documentation of some tools are confusing. Here are two examples:

1. im2rec contains 2 versions, C++ (binary) and python. But nobody would thought that the argparse in these tools are different (in the meantime, there is no appropriate examples to compare with, users could only use them by guessing the usage).

2. How to combine MXNet distributed platform with supercomputing tool such as Slurm? How do we do profiling and how to debug. A couples of companies I knew thought of using MXNet for distributed training. Due to lack of examples and poor support from the community, they have to change their models into TensorFlow and Horovod.

5. The heavy code base. Most of the MXNet examples/source code/documentation/language binding are in a single repo. A git clone operation will cost tens of Mb. The New feature PR would takes longer time than expected. The poor reviewing response / rules keeps new contributors away from the community. I remember there was a call for document-improvement last year. The total timeline cost a user 3 months of time to merge into master. It almost equals to a release interval of Pytorch.

6. To Developers. There are very few people in the community discussed the improvement we can take to make MXNet more user-friendly. It's been so easy to trigger tens of stack issues during coding. Again, is that a requirement for MXNet users to be familiar with C++? The connection between Python and C lacks a IDE lint (maybe MXNet assume every developers as a VIM master). API/underlying implementation chaged frequently. People have to release their code with an achieved version of MXNet (such as TuSimple and MSRA). Let's take a look at PyTorch, an API used move tensor to device would raise a thorough discussion.

There will be more comments translated to English and I will keep this thread updated…
Thanks,
Qing

Re: Some feedback from MXNet Zhihu topic

Posted by Carin Meier <ca...@gmail.com>.

Totally agree about the potential huge benefit of having new research
papers having implementation examples in MXNet. Wondering if anyone had any
brainstorm ideas about how to facilitate/ encourage this?

Also wanted to note that I think the recent progress and attention to
stability will help to both speed the PR process and release cycle. There
is more work to do in this area especially in regards to automation of the
release that I think will yield big dividends down the road. Let's keep up
the good work in this area.

- Carin

On Thu, Sep 20, 2018 at 4:10 AM Naveen Swamy <mn...@gmail.com> wrote:

> Qing,
>
> this is so loaded and very specific suggestions. Thank you for bringing up
> here, since Apache MXNet is popular in China, It would be great if Mandrin
> speaking developers here could bring such feedback and user pain to the
> community's attention.
>
> 1. To capture specific API/Example/Tutorial that users have an issue on, Mu
> suggested in the past to add thumbs up/down on the website:
> https://issues.apache.org/jira/browse/MXNET-972
>
> 6. The heavy code base is not because of the code in the MXNet repo, its
> all the sub-modules that are added to the repo - I have had this problem
> too, to build MXNet i have to fetch and build the whole world that MXNet
> depends on and its dependency(sub within sub) - I think its time to revisit
> and refactor.
>
> For others I suggest you work with someone to create actionable JIRAs(may
> be Denis - because he knowledgable JIRA and creates nice actionable
> stories), it would be nice if these stories can contain many
> first-good-issue tasks for new contributors to pick up - creating
> standalone examples(from existing) is a great one for newbies to learn
> MXNet and contribute back.
>
> Examples are very important for someone to not only quickly learn but also
> extend/adopt to their own application, In Scala we(you) have added tests
> around Examples and actually use them as integration tests - we should do
> insist the same for new examples written or old examples that we touch .
>
> In Deep Learning what is more critical and could increase rapid adoption is
> to have the latest and greatest papers implemented as examples - this is a
> call for suggestions and Action to the community.
>
> Thanks, Naveen
>
>
> On Wed, Sep 19, 2018 at 10:39 PM, Aaron Markham <aaron.s.markham@gmail.com
> >
> wrote:
>
> > Thanks for this translation and feedback Qing!
> > I've addressed point 3 of the documentation feedback with this PR:
> > https://github.com/apache/incubator-mxnet/pull/12604
> > I'm not sure how to take the first two points without some explicit URLs
> > and examples, so if anyone has those I'd be happy to take a look if
> there's
> > some glitch vs missing or wrong docs.
> >
> > Also, I would agree that there should be some more simple examples. Often
> > times the examples are too complicated and unclear about what is
> important
> > or not. The audience targeting is for deep learning practitioners, not
> > "newbies".
> >
> > And on a related note, I'd really like to pull the Gluon stuff into the
> API
> > section. It's confusing as its own navigation item and orphaned
> > information. It could have a navigation entry at the top of the API list
> > like "Python: Gluon" or just "Gluon" then list "Python: Module" or just
> > "Python". Or running this the other way, the Gluon menu could have API
> and
> > Tutorials and be more fleshed out, though this is not my preference.
> Either
> > way, it needs some attention.
> >
> > Cheers,
> > Aaron
> >
> > On Wed, Sep 19, 2018 at 11:04 AM Qing Lan <la...@live.com> wrote:
> >
> > > Hi all,
> > >
> > > There was a trend topic<https://www.zhihu.com/question/293996867> in
> > > Zhihu (a famous Chinese Stackoverflow+Quora) asking about the status of
> > > MXNet in 2018 recently. Mu replied the thread and obtained more than
> 300+
> > > `like`.
> > > However there are a few concerns addressed in the comments of this
> > thread,
> > > I have done some simple translation from Chinese to English:
> > >
> > > 1. Documentations! Until now, the online doc still contains:
> > >                 1. Depreciated but not updated doc
> > >                 2. Wrong documentation with poor description
> > >                 3. Document in Alpha stage such as you must install
> `pip
> > > –pre` in order to run.
> > >
> > > 2. Examples! For Gluon specifically, many examples are still mixing
> > > Gluon/MXNet apis. The mixure of mx.sym, mx.nd mx.gluon confused the
> users
> > > of what is the right one to choose in order to get their model to work.
> > As
> > > an example, Although Gluon made data encapsulation possible, still
> there
> > > are examples using mxn.io.ImageRecordIter with tens of params (feels
> like
> > > gluon examples are simply the copy from old Python examples).
> > >
> > > 3. Examples again! Comparing to PyTorch, there are a few examples I
> don't
> > > like in Gluon:
> > >                 1. Available to run however the code structure is still
> > > very complicated. Such as example/image-classification/cifar10.py. It
> > > seemed like a consecutive code concatenation. In fact, these are just a
> > > series of layers mixed with model.fit. It makes user very hard to
> > > modify/extend the model.
> > >                 2. Only available to run with certain settings. If
> users
> > > try to change a little bit in the model, crashes will happen. For
> > example,
> > > the multi-gpu example in Gluon website, MXNet hide the logic that using
> > > batch size to change learning rate in a optimizer. A lot of newbies
> > didn't
> > > know this fact and they would only find that the model stopped
> converging
> > > when batch size changed.
> > >                 3. The worst scenario is the model itself just simply
> > > didn't work. Maintainers in the MXNet community didn't run the model
> > (even
> > > no integration test) and merge the code directly. It makes the script
> not
> > > able run till somebody raise the issues and fix it.
> > >
> > > 4. The Community problem. The core advantage for MXNet is it's
> > scalability
> > > and efficiency. However, the documentation of some tools are confusing.
> > > Here are two examples:
> > >
> > >                 1. im2rec contains 2 versions, C++ (binary) and python.
> > > But nobody would thought that the argparse in these tools are different
> > (in
> > > the meantime, there is no appropriate examples to compare with, users
> > could
> > > only use them by guessing the usage).
> > >
> > >                 2. How to combine MXNet distributed platform with
> > > supercomputing tool such as Slurm? How do we do profiling and how to
> > debug.
> > > A couples of companies I knew thought of using MXNet for distributed
> > > training. Due to lack of examples and poor support from the community,
> > they
> > > have to change their models into TensorFlow and Horovod.
> > >
> > > 5. The heavy code base. Most of the MXNet examples/source
> > > code/documentation/language binding are in a single repo. A git clone
> > > operation will cost tens of Mb. The New feature PR would takes longer
> > time
> > > than expected. The poor reviewing response / rules keeps new
> contributors
> > > away from the community. I remember there was a call for
> > > document-improvement last year. The total timeline cost a user 3 months
> > of
> > > time to merge into master. It almost equals to a release interval of
> > > Pytorch.
> > >
> > > 6. To Developers. There are very few people in the community discussed
> > the
> > > improvement we can take to make MXNet more user-friendly. It's been so
> > easy
> > > to trigger tens of stack issues during coding. Again, is that a
> > requirement
> > > for MXNet users to be familiar with C++? The connection between Python
> > and
> > > C lacks a IDE lint (maybe MXNet assume every developers as a VIM
> master).
> > > API/underlying implementation chaged frequently. People have to release
> > > their code with an achieved version of MXNet (such as TuSimple and
> MSRA).
> > > Let's take a look at PyTorch, an API used move tensor to device would
> > raise
> > > a thorough discussion.
> > >
> > > There will be more comments translated to English and I will keep this
> > > thread updated…
> > > Thanks,
> > > Qing
> > >
> >
>

Re: Some feedback from MXNet Zhihu topic

Posted by Tianqi Chen <tq...@cs.washington.edu>.

Thanks for the great feedbacks. I want to point out though that the cost of
building Mxnet is mainly on the operators that sit on Mxnet repo, rather
than its submodules

Tianqi


On Thu, Sep 20, 2018 at 1:10 AM Naveen Swamy <mn...@gmail.com> wrote:

> Qing,
>
> this is so loaded and very specific suggestions. Thank you for bringing up
> here, since Apache MXNet is popular in China, It would be great if Mandrin
> speaking developers here could bring such feedback and user pain to the
> community's attention.
>
> 1. To capture specific API/Example/Tutorial that users have an issue on, Mu
> suggested in the past to add thumbs up/down on the website:
> https://issues.apache.org/jira/browse/MXNET-972
>
> 6. The heavy code base is not because of the code in the MXNet repo, its
> all the sub-modules that are added to the repo - I have had this problem
> too, to build MXNet i have to fetch and build the whole world that MXNet
> depends on and its dependency(sub within sub) - I think its time to revisit
> and refactor.
>
> For others I suggest you work with someone to create actionable JIRAs(may
> be Denis - because he knowledgable JIRA and creates nice actionable
> stories), it would be nice if these stories can contain many
> first-good-issue tasks for new contributors to pick up - creating
> standalone examples(from existing) is a great one for newbies to learn
> MXNet and contribute back.
>
> Examples are very important for someone to not only quickly learn but also
> extend/adopt to their own application, In Scala we(you) have added tests
> around Examples and actually use them as integration tests - we should do
> insist the same for new examples written or old examples that we touch .
>
> In Deep Learning what is more critical and could increase rapid adoption is
> to have the latest and greatest papers implemented as examples - this is a
> call for suggestions and Action to the community.
>
> Thanks, Naveen
>
>
> On Wed, Sep 19, 2018 at 10:39 PM, Aaron Markham <aaron.s.markham@gmail.com
> >
> wrote:
>
> > Thanks for this translation and feedback Qing!
> > I've addressed point 3 of the documentation feedback with this PR:
> > https://github.com/apache/incubator-mxnet/pull/12604
> > I'm not sure how to take the first two points without some explicit URLs
> > and examples, so if anyone has those I'd be happy to take a look if
> there's
> > some glitch vs missing or wrong docs.
> >
> > Also, I would agree that there should be some more simple examples. Often
> > times the examples are too complicated and unclear about what is
> important
> > or not. The audience targeting is for deep learning practitioners, not
> > "newbies".
> >
> > And on a related note, I'd really like to pull the Gluon stuff into the
> API
> > section. It's confusing as its own navigation item and orphaned
> > information. It could have a navigation entry at the top of the API list
> > like "Python: Gluon" or just "Gluon" then list "Python: Module" or just
> > "Python". Or running this the other way, the Gluon menu could have API
> and
> > Tutorials and be more fleshed out, though this is not my preference.
> Either
> > way, it needs some attention.
> >
> > Cheers,
> > Aaron
> >
> > On Wed, Sep 19, 2018 at 11:04 AM Qing Lan <la...@live.com> wrote:
> >
> > > Hi all,
> > >
> > > There was a trend topic<https://www.zhihu.com/question/293996867> in
> > > Zhihu (a famous Chinese Stackoverflow+Quora) asking about the status of
> > > MXNet in 2018 recently. Mu replied the thread and obtained more than
> 300+
> > > `like`.
> > > However there are a few concerns addressed in the comments of this
> > thread,
> > > I have done some simple translation from Chinese to English:
> > >
> > > 1. Documentations! Until now, the online doc still contains:
> > >                 1. Depreciated but not updated doc
> > >                 2. Wrong documentation with poor description
> > >                 3. Document in Alpha stage such as you must install
> `pip
> > > –pre` in order to run.
> > >
> > > 2. Examples! For Gluon specifically, many examples are still mixing
> > > Gluon/MXNet apis. The mixure of mx.sym, mx.nd mx.gluon confused the
> users
> > > of what is the right one to choose in order to get their model to work.
> > As
> > > an example, Although Gluon made data encapsulation possible, still
> there
> > > are examples using mxn.io.ImageRecordIter with tens of params (feels
> like
> > > gluon examples are simply the copy from old Python examples).
> > >
> > > 3. Examples again! Comparing to PyTorch, there are a few examples I
> don't
> > > like in Gluon:
> > >                 1. Available to run however the code structure is still
> > > very complicated. Such as example/image-classification/cifar10.py. It
> > > seemed like a consecutive code concatenation. In fact, these are just a
> > > series of layers mixed with model.fit. It makes user very hard to
> > > modify/extend the model.
> > >                 2. Only available to run with certain settings. If
> users
> > > try to change a little bit in the model, crashes will happen. For
> > example,
> > > the multi-gpu example in Gluon website, MXNet hide the logic that using
> > > batch size to change learning rate in a optimizer. A lot of newbies
> > didn't
> > > know this fact and they would only find that the model stopped
> converging
> > > when batch size changed.
> > >                 3. The worst scenario is the model itself just simply
> > > didn't work. Maintainers in the MXNet community didn't run the model
> > (even
> > > no integration test) and merge the code directly. It makes the script
> not
> > > able run till somebody raise the issues and fix it.
> > >
> > > 4. The Community problem. The core advantage for MXNet is it's
> > scalability
> > > and efficiency. However, the documentation of some tools are confusing.
> > > Here are two examples:
> > >
> > >                 1. im2rec contains 2 versions, C++ (binary) and python.
> > > But nobody would thought that the argparse in these tools are different
> > (in
> > > the meantime, there is no appropriate examples to compare with, users
> > could
> > > only use them by guessing the usage).
> > >
> > >                 2. How to combine MXNet distributed platform with
> > > supercomputing tool such as Slurm? How do we do profiling and how to
> > debug.
> > > A couples of companies I knew thought of using MXNet for distributed
> > > training. Due to lack of examples and poor support from the community,
> > they
> > > have to change their models into TensorFlow and Horovod.
> > >
> > > 5. The heavy code base. Most of the MXNet examples/source
> > > code/documentation/language binding are in a single repo. A git clone
> > > operation will cost tens of Mb. The New feature PR would takes longer
> > time
> > > than expected. The poor reviewing response / rules keeps new
> contributors
> > > away from the community. I remember there was a call for
> > > document-improvement last year. The total timeline cost a user 3 months
> > of
> > > time to merge into master. It almost equals to a release interval of
> > > Pytorch.
> > >
> > > 6. To Developers. There are very few people in the community discussed
> > the
> > > improvement we can take to make MXNet more user-friendly. It's been so
> > easy
> > > to trigger tens of stack issues during coding. Again, is that a
> > requirement
> > > for MXNet users to be familiar with C++? The connection between Python
> > and
> > > C lacks a IDE lint (maybe MXNet assume every developers as a VIM
> master).
> > > API/underlying implementation chaged frequently. People have to release
> > > their code with an achieved version of MXNet (such as TuSimple and
> MSRA).
> > > Let's take a look at PyTorch, an API used move tensor to device would
> > raise
> > > a thorough discussion.
> > >
> > > There will be more comments translated to English and I will keep this
> > > thread updated…
> > > Thanks,
> > > Qing
> > >
> >
>

Re: Some feedback from MXNet Zhihu topic

Posted by Naveen Swamy <mn...@gmail.com>.

Qing,

this is so loaded and very specific suggestions. Thank you for bringing up
here, since Apache MXNet is popular in China, It would be great if Mandrin
speaking developers here could bring such feedback and user pain to the
community's attention.

1. To capture specific API/Example/Tutorial that users have an issue on, Mu
suggested in the past to add thumbs up/down on the website:
https://issues.apache.org/jira/browse/MXNET-972

6. The heavy code base is not because of the code in the MXNet repo, its
all the sub-modules that are added to the repo - I have had this problem
too, to build MXNet i have to fetch and build the whole world that MXNet
depends on and its dependency(sub within sub) - I think its time to revisit
and refactor.

For others I suggest you work with someone to create actionable JIRAs(may
be Denis - because he knowledgable JIRA and creates nice actionable
stories), it would be nice if these stories can contain many
first-good-issue tasks for new contributors to pick up - creating
standalone examples(from existing) is a great one for newbies to learn
MXNet and contribute back.

Examples are very important for someone to not only quickly learn but also
extend/adopt to their own application, In Scala we(you) have added tests
around Examples and actually use them as integration tests - we should do
insist the same for new examples written or old examples that we touch .

In Deep Learning what is more critical and could increase rapid adoption is
to have the latest and greatest papers implemented as examples - this is a
call for suggestions and Action to the community.

Thanks, Naveen


On Wed, Sep 19, 2018 at 10:39 PM, Aaron Markham <aa...@gmail.com>
wrote:

> Thanks for this translation and feedback Qing!
> I've addressed point 3 of the documentation feedback with this PR:
> https://github.com/apache/incubator-mxnet/pull/12604
> I'm not sure how to take the first two points without some explicit URLs
> and examples, so if anyone has those I'd be happy to take a look if there's
> some glitch vs missing or wrong docs.
>
> Also, I would agree that there should be some more simple examples. Often
> times the examples are too complicated and unclear about what is important
> or not. The audience targeting is for deep learning practitioners, not
> "newbies".
>
> And on a related note, I'd really like to pull the Gluon stuff into the API
> section. It's confusing as its own navigation item and orphaned
> information. It could have a navigation entry at the top of the API list
> like "Python: Gluon" or just "Gluon" then list "Python: Module" or just
> "Python". Or running this the other way, the Gluon menu could have API and
> Tutorials and be more fleshed out, though this is not my preference. Either
> way, it needs some attention.
>
> Cheers,
> Aaron
>
> On Wed, Sep 19, 2018 at 11:04 AM Qing Lan <la...@live.com> wrote:
>
> > Hi all,
> >
> > There was a trend topic<https://www.zhihu.com/question/293996867> in
> > Zhihu (a famous Chinese Stackoverflow+Quora) asking about the status of
> > MXNet in 2018 recently. Mu replied the thread and obtained more than 300+
> > `like`.
> > However there are a few concerns addressed in the comments of this
> thread,
> > I have done some simple translation from Chinese to English:
> >
> > 1. Documentations! Until now, the online doc still contains:
> >                 1. Depreciated but not updated doc
> >                 2. Wrong documentation with poor description
> >                 3. Document in Alpha stage such as you must install `pip
> > –pre` in order to run.
> >
> > 2. Examples! For Gluon specifically, many examples are still mixing
> > Gluon/MXNet apis. The mixure of mx.sym, mx.nd mx.gluon confused the users
> > of what is the right one to choose in order to get their model to work.
> As
> > an example, Although Gluon made data encapsulation possible, still there
> > are examples using mxn.io.ImageRecordIter with tens of params (feels like
> > gluon examples are simply the copy from old Python examples).
> >
> > 3. Examples again! Comparing to PyTorch, there are a few examples I don't
> > like in Gluon:
> >                 1. Available to run however the code structure is still
> > very complicated. Such as example/image-classification/cifar10.py. It
> > seemed like a consecutive code concatenation. In fact, these are just a
> > series of layers mixed with model.fit. It makes user very hard to
> > modify/extend the model.
> >                 2. Only available to run with certain settings. If users
> > try to change a little bit in the model, crashes will happen. For
> example,
> > the multi-gpu example in Gluon website, MXNet hide the logic that using
> > batch size to change learning rate in a optimizer. A lot of newbies
> didn't
> > know this fact and they would only find that the model stopped converging
> > when batch size changed.
> >                 3. The worst scenario is the model itself just simply
> > didn't work. Maintainers in the MXNet community didn't run the model
> (even
> > no integration test) and merge the code directly. It makes the script not
> > able run till somebody raise the issues and fix it.
> >
> > 4. The Community problem. The core advantage for MXNet is it's
> scalability
> > and efficiency. However, the documentation of some tools are confusing.
> > Here are two examples:
> >
> >                 1. im2rec contains 2 versions, C++ (binary) and python.
> > But nobody would thought that the argparse in these tools are different
> (in
> > the meantime, there is no appropriate examples to compare with, users
> could
> > only use them by guessing the usage).
> >
> >                 2. How to combine MXNet distributed platform with
> > supercomputing tool such as Slurm? How do we do profiling and how to
> debug.
> > A couples of companies I knew thought of using MXNet for distributed
> > training. Due to lack of examples and poor support from the community,
> they
> > have to change their models into TensorFlow and Horovod.
> >
> > 5. The heavy code base. Most of the MXNet examples/source
> > code/documentation/language binding are in a single repo. A git clone
> > operation will cost tens of Mb. The New feature PR would takes longer
> time
> > than expected. The poor reviewing response / rules keeps new contributors
> > away from the community. I remember there was a call for
> > document-improvement last year. The total timeline cost a user 3 months
> of
> > time to merge into master. It almost equals to a release interval of
> > Pytorch.
> >
> > 6. To Developers. There are very few people in the community discussed
> the
> > improvement we can take to make MXNet more user-friendly. It's been so
> easy
> > to trigger tens of stack issues during coding. Again, is that a
> requirement
> > for MXNet users to be familiar with C++? The connection between Python
> and
> > C lacks a IDE lint (maybe MXNet assume every developers as a VIM master).
> > API/underlying implementation chaged frequently. People have to release
> > their code with an achieved version of MXNet (such as TuSimple and MSRA).
> > Let's take a look at PyTorch, an API used move tensor to device would
> raise
> > a thorough discussion.
> >
> > There will be more comments translated to English and I will keep this
> > thread updated…
> > Thanks,
> > Qing
> >
>

Re: Some feedback from MXNet Zhihu topic

Posted by Aaron Markham <aa...@gmail.com>.

Thanks for this translation and feedback Qing!
I've addressed point 3 of the documentation feedback with this PR:
https://github.com/apache/incubator-mxnet/pull/12604
I'm not sure how to take the first two points without some explicit URLs
and examples, so if anyone has those I'd be happy to take a look if there's
some glitch vs missing or wrong docs.

Also, I would agree that there should be some more simple examples. Often
times the examples are too complicated and unclear about what is important
or not. The audience targeting is for deep learning practitioners, not
"newbies".

And on a related note, I'd really like to pull the Gluon stuff into the API
section. It's confusing as its own navigation item and orphaned
information. It could have a navigation entry at the top of the API list
like "Python: Gluon" or just "Gluon" then list "Python: Module" or just
"Python". Or running this the other way, the Gluon menu could have API and
Tutorials and be more fleshed out, though this is not my preference. Either
way, it needs some attention.

Cheers,
Aaron

On Wed, Sep 19, 2018 at 11:04 AM Qing Lan <la...@live.com> wrote:

> Hi all,
>
> There was a trend topic<https://www.zhihu.com/question/293996867> in
> Zhihu (a famous Chinese Stackoverflow+Quora) asking about the status of
> MXNet in 2018 recently. Mu replied the thread and obtained more than 300+
> `like`.
> However there are a few concerns addressed in the comments of this thread,
> I have done some simple translation from Chinese to English:
>
> 1. Documentations! Until now, the online doc still contains:
>                 1. Depreciated but not updated doc
>                 2. Wrong documentation with poor description
>                 3. Document in Alpha stage such as you must install `pip
> –pre` in order to run.
>
> 2. Examples! For Gluon specifically, many examples are still mixing
> Gluon/MXNet apis. The mixure of mx.sym, mx.nd mx.gluon confused the users
> of what is the right one to choose in order to get their model to work. As
> an example, Although Gluon made data encapsulation possible, still there
> are examples using mxn.io.ImageRecordIter with tens of params (feels like
> gluon examples are simply the copy from old Python examples).
>
> 3. Examples again! Comparing to PyTorch, there are a few examples I don't
> like in Gluon:
>                 1. Available to run however the code structure is still
> very complicated. Such as example/image-classification/cifar10.py. It
> seemed like a consecutive code concatenation. In fact, these are just a
> series of layers mixed with model.fit. It makes user very hard to
> modify/extend the model.
>                 2. Only available to run with certain settings. If users
> try to change a little bit in the model, crashes will happen. For example,
> the multi-gpu example in Gluon website, MXNet hide the logic that using
> batch size to change learning rate in a optimizer. A lot of newbies didn't
> know this fact and they would only find that the model stopped converging
> when batch size changed.
>                 3. The worst scenario is the model itself just simply
> didn't work. Maintainers in the MXNet community didn't run the model (even
> no integration test) and merge the code directly. It makes the script not
> able run till somebody raise the issues and fix it.
>
> 4. The Community problem. The core advantage for MXNet is it's scalability
> and efficiency. However, the documentation of some tools are confusing.
> Here are two examples:
>
>                 1. im2rec contains 2 versions, C++ (binary) and python.
> But nobody would thought that the argparse in these tools are different (in
> the meantime, there is no appropriate examples to compare with, users could
> only use them by guessing the usage).
>
>                 2. How to combine MXNet distributed platform with
> supercomputing tool such as Slurm? How do we do profiling and how to debug.
> A couples of companies I knew thought of using MXNet for distributed
> training. Due to lack of examples and poor support from the community, they
> have to change their models into TensorFlow and Horovod.
>
> 5. The heavy code base. Most of the MXNet examples/source
> code/documentation/language binding are in a single repo. A git clone
> operation will cost tens of Mb. The New feature PR would takes longer time
> than expected. The poor reviewing response / rules keeps new contributors
> away from the community. I remember there was a call for
> document-improvement last year. The total timeline cost a user 3 months of
> time to merge into master. It almost equals to a release interval of
> Pytorch.
>
> 6. To Developers. There are very few people in the community discussed the
> improvement we can take to make MXNet more user-friendly. It's been so easy
> to trigger tens of stack issues during coding. Again, is that a requirement
> for MXNet users to be familiar with C++? The connection between Python and
> C lacks a IDE lint (maybe MXNet assume every developers as a VIM master).
> API/underlying implementation chaged frequently. People have to release
> their code with an achieved version of MXNet (such as TuSimple and MSRA).
> Let's take a look at PyTorch, an API used move tensor to device would raise
> a thorough discussion.
>
> There will be more comments translated to English and I will keep this
> thread updated…
> Thanks,
> Qing
>

Re: Some feedback from MXNet Zhihu topic

Posted by Foivos Diakogiannis <ph...@gmail.com>.

Hi Kellen (and all),

thank you for your reply. I think there is a lack of C++ documentation, and
documentation could be improved. I will try and explain with examples the
difficulty I have with the C++ API.

The problem I face with the C++ documentation is that the only thing I see
as documentation is Doxygen generated list of namespaces/classes/files.
That is, the first thing I see when press the C++ API is this:

[image: image.png]

this doesn't tell me much about what is the underlying connection of
classes, or how mxnet is designed etc (huge difference between what I see
when I click on the python API). I could equally well go directly into the
source code. The mxnet CPP Package, is a github repository that has all the
source code and several c++ examples. When I look into them, I see examples
of usage. They make sense as I read along, but the main barrier is that I
do not have an underlying understanding, a diagram, of how things are
connected. This is in contrast with the documentation of python, where, say
for Gluon, I have "under the hood" examples, and I can look at the source
code and understand how things are connected (most times, and up to a
level). In addition, for python, along with the description of objects and
methods, sometimes exist also examples. So, say for example I want to
create a 2nd order gradient, with C++, modify the source code and expose it
to python. Where can I find info/docs for that? Where is the
differentiation defined in the C++ code? This is not evident from the
documentation - at least not to me. In contrast, I can find easily the
mxnet.autograd python package with custom function definitions.

Let's see some other examples from C++ libaries, e.g. boost, it has full
documentation for the various packages starting from "hello world"
constructs most of the times, e.g. accumulators (something similar could
exist for NDArray or Symbol):
https://www.boost.org/doc/libs/1_68_0/doc/html/accumulators.html or boost
C++ to python  etc.
Another example is the documentation of the Evolving Objects library (
http://eodev.sourceforge.net/), where they have a more "enriched" doxygen
documentation (http://eodev.sourceforge.net/eo/doc/html/index.html), and a
basic tutorial (although outdated -
http://eodev.sourceforge.net/eo/tutorial/html/eoTutorial.html) that
describes the underlying structure and binds the different objects
together.


I think one way that the documentation for C++ could be improved is:
a) add a description of the underlying connection of the objects within
mxnet library. An overall "big picture".
b) add examples of modifying the C++ source code to add features. This
perhaps could be the most useful for user contribution. There exist
examples of designing custom layers in python with the gluon API.

c) optional but most useful: add words, examples, like a document (built on
the existing examples), that explains more the source code and classes used
and "connects the dots" between the underlying structures, just like is
done for python examples, or other C++ libraries examples. There are pages
where they explain the source code as they go along, and then, in the end,
there is a link to the full code. I understand that the intended usage of
C++ code is to be callled from python wrappers, and perhaps this would not
be a good time investment for the project.


Again a huge thank you for this awesome library and all the work you've put
into it. All the above comments are with the best intentions.

Kind regards,
Foivos



On Tue, Sep 25, 2018 at 10:47 PM kellen sunderland <
kellen.sunderland@gmail.com> wrote:

> Thanks for the detailed feedback Foivos.  Just want to dig into the C++
> comment.
>
> Any more details you could give on how we could improve the readability.
> Would modernizing the codebase and trying to provide consistent code style
> help?  In regards to documentation was it that it's mostly lacking, or that
> the existing documentation could be improved (or both)?
>
> -Kellen
>
> On Mon, Sep 24, 2018 at 8:43 PM Foivos Diakogiannis <
> phoevos.diakogiannis@gmail.com> wrote:
>
> > Dear all,
> >
> > first my compliments on this great software, and thank you all for the
> > effort you've put into this.
> >
> > I am a gluon API user, and I thought I should give some feedback to
> > highlight some user-perspective issues. I am working in CSIRO and I am
> > using gluon to write and deploy custom deep learning models for semantic
> > segmentation/classification on CSIRO HPC facilities. I came into the deep
> > learning world as of July 2017 (2nd postdoc, after astronomy), starting
> > with Keras (great intro, but too simple/automated for my needs), moving
> on
> > to TF (complexity of C++, with the inconvenience of python performance +
> > memory management was bad. On the plus side, great
> documentation+community
> > support, and of course great product overall, just not for me) and as of
> > December 2017 I am using gluon exclusively since it solved the majority
> of
> > my problems.
> >
> > Things I love about gluon:
> > 1. Great structured tutorials (https://gluon.mxnet.io/), like a book. In
> > fact, at the time of starting using Gluon, this was better (i.e. more
> > structured, with a beginning and an end) than pytorch documentation.
> > 2. Efficient code, both in speed and GPU memory consumption.
> > 3. With a push of a button (hybridize) I can go from research to
> > production. I get up to x3-4 speed up, this is a huge benefit and I don't
> > see other frameworks easily beating that, in the immediate future.
> > torch.jit is nowhere near the ease of use of hybridize() - not yet.
> > 4. I really value the effort/support given in discuss.mxnet.io forum.
> > Almost always when I have a problem I find a solution there, from
> experts.
> > This complements my knowledge/understanding on the code around the gluon
> > API.
> > 5. Super easy data parallel modeling. The examples in the tutorial make
> > life really easy. This made a huge difference for me, and it was the main
> > reason I switched from TF to gluon.
> >
> > Things I find difficult with gluon:
> > 1. Documentation is not in one place, so gluon-cv and gluon-nlp are
> things
> > I've learned they exist (and they have great examples) via twitter. These
> > should be on the main mxnet page, somewhere altogether (they should
> > actually be advertised). In addition, some times the examples are not
> > updated with the latest changes. For example,
> > mynet.collect_params().initialize(...) on the gluon "book" should now be
> > mynet.initialize(...), and several other examples on the same spirit.
> Also,
> > I don't see a clear definition/description of new methods when added to
> > know how to improve my code, in the release announcements. For example,
> > I've learned about the block.summary(*inputs) features by checking on the
> > pull requests.  Yes, it exists on the official API documentation, and I
> am
> > used in going through all of every now and then. Can be done better.
> > 2. Not all custom architectures are easy to implement in a hybrid format.
> > For example, taking the shape of a layer and using this as information
> for
> > pooling layers (or other things) is not easy (without copying to cpu
> > first), and many times I have to implement many hacks to get around this
> > (for performance gains). For example, here:
> > https://discuss.mxnet.io/t/coordconv-layer/1394/4 Another example is the
> > pyramid scene parsing network, it took me a lot of time and many hacks to
> > hybridize it.
> > 3. The distributed examples are not yet fully functional. When one needs
> to
> > run distributed computing for increasing the batch size is OK-ish (under
> > SLURM manager, see this:
> > https://discuss.mxnet.io/t/distributed-training-questions/1269/6 ), but
> > when one wants to implement async SGD - at least for me - is still an
> open
> > problem. Of course, I completely understand that distributed training is
> > still very much a research project, and I am not sure if using a large
> > batch size is good for training (hence my effort to use async SGD). I've
> > read various opinions on research papers for this. At the moment I am
> using
> > distributed only for hyper parameter optimization, as I increase the
> batch
> > size (when necessary) with delayed gradient updates.
> > 4. No higher order gradient support. This is where pytorch is better, and
> > where I am forced to use it in my GAN experiments for gradient penalty
> > implementation ( https://github.com/apache/incubator-mxnet/issues/10002
> ).
> > I
> > hope that this will change in the immediate future. It is my
> understanding
> > that a lot of effort goes into semi-supervised training techniques and my
> > gut feeling tells me that GANs are an important key ingredient to the
> > solution of this problem.
> >
> >
> > Things I really don't like about mxnet:
> > 1. The documentation for C++ is not clear. I am developing code in C++
> for
> > the past 8 years. I am not a software engineer by training but I feel
> > comfortable-ish in looking, say, in the source code of boost library or
> > Eigen. I cannot say the same thing for mxnet. This is a barrier for me to
> > even think contributing in C++ code.
> >
> > Again, many thanks for all your efforts and this awesome library!
> >
> > Regards,
> > Foivos Diakogiannis
> >
> >
> >
> >
> >
> >
> > On Fri, Sep 21, 2018 at 12:51 AM Timur Shenkao <ts...@timshenkao.su>
> wrote:
> >
> > > There are:
> > > Gluon API
> > > Module API
> > > Some other apis in mxnet
> > >  low-level C / C++ apis
> > >
> > > Recently I accidentally found that exist such things like Gluon NLP and
> > > Gluon CV (besides some examples in the very MXNet).
> > > It's unclear whether I can rely on some API or I have to create my own
> C
> > /
> > > C++ code.
> > >
> > > I implement publicly available articles and some other ideas in TF all
> > the
> > > time. But when it comes to MXNet, I am often reluctant because it's
> > > difficult to understand which way to go. It's unclear whether my
> efforts
> > > will result in some working model or I will get stuck.
> > > Points #5 and #6 are absolutely true.
> > > As for documentation, all projects in their turbulent phase of
> lifecycle
> > > have outdated docs, it's normal. I say docs are very good (I remember
> > early
> > > Spark & DL4J docs 😂 )
> > >
> > >
> > >
> > > On Thursday, September 20, 2018, Tianqi Chen <tqchen@cs.washington.edu
> >
> > > wrote:
> > >
> > > > The key complain here is mainly about the clarity of the documents
> > > > themselves. Maybe it is time to focus on a single flavor of API that
> is
> > > > useful(Gluon) and highlight all the docs around that
> > > >
> > > > Tianqi
> > > >
> > > >
> > > > On Wed, Sep 19, 2018 at 11:04 AM Qing Lan <la...@live.com>
> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > There was a trend topic<https://www.zhihu.com/question/293996867>
> in
> > > > > Zhihu (a famous Chinese Stackoverflow+Quora) asking about the
> status
> > of
> > > > > MXNet in 2018 recently. Mu replied the thread and obtained more
> than
> > > 300+
> > > > > `like`.
> > > > > However there are a few concerns addressed in the comments of this
> > > > thread,
> > > > > I have done some simple translation from Chinese to English:
> > > > >
> > > > > 1. Documentations! Until now, the online doc still contains:
> > > > >                 1. Depreciated but not updated doc
> > > > >                 2. Wrong documentation with poor description
> > > > >                 3. Document in Alpha stage such as you must install
> > > `pip
> > > > > –pre` in order to run.
> > > > >
> > > > > 2. Examples! For Gluon specifically, many examples are still mixing
> > > > > Gluon/MXNet apis. The mixure of mx.sym, mx.nd mx.gluon confused the
> > > users
> > > > > of what is the right one to choose in order to get their model to
> > work.
> > > > As
> > > > > an example, Although Gluon made data encapsulation possible, still
> > > there
> > > > > are examples using mxn.io.ImageRecordIter with tens of params
> (feels
> > > like
> > > > > gluon examples are simply the copy from old Python examples).
> > > > >
> > > > > 3. Examples again! Comparing to PyTorch, there are a few examples I
> > > don't
> > > > > like in Gluon:
> > > > >                 1. Available to run however the code structure is
> > still
> > > > > very complicated. Such as example/image-classification/cifar10.py.
> It
> > > > > seemed like a consecutive code concatenation. In fact, these are
> > just a
> > > > > series of layers mixed with model.fit. It makes user very hard to
> > > > > modify/extend the model.
> > > > >                 2. Only available to run with certain settings. If
> > > users
> > > > > try to change a little bit in the model, crashes will happen. For
> > > > example,
> > > > > the multi-gpu example in Gluon website, MXNet hide the logic that
> > using
> > > > > batch size to change learning rate in a optimizer. A lot of newbies
> > > > didn't
> > > > > know this fact and they would only find that the model stopped
> > > converging
> > > > > when batch size changed.
> > > > >                 3. The worst scenario is the model itself just
> simply
> > > > > didn't work. Maintainers in the MXNet community didn't run the
> model
> > > > (even
> > > > > no integration test) and merge the code directly. It makes the
> script
> > > not
> > > > > able run till somebody raise the issues and fix it.
> > > > >
> > > > > 4. The Community problem. The core advantage for MXNet is it's
> > > > scalability
> > > > > and efficiency. However, the documentation of some tools are
> > confusing.
> > > > > Here are two examples:
> > > > >
> > > > >                 1. im2rec contains 2 versions, C++ (binary) and
> > python.
> > > > > But nobody would thought that the argparse in these tools are
> > different
> > > > (in
> > > > > the meantime, there is no appropriate examples to compare with,
> users
> > > > could
> > > > > only use them by guessing the usage).
> > > > >
> > > > >                 2. How to combine MXNet distributed platform with
> > > > > supercomputing tool such as Slurm? How do we do profiling and how
> to
> > > > debug.
> > > > > A couples of companies I knew thought of using MXNet for
> distributed
> > > > > training. Due to lack of examples and poor support from the
> > community,
> > > > they
> > > > > have to change their models into TensorFlow and Horovod.
> > > > >
> > > > > 5. The heavy code base. Most of the MXNet examples/source
> > > > > code/documentation/language binding are in a single repo. A git
> clone
> > > > > operation will cost tens of Mb. The New feature PR would takes
> longer
> > > > time
> > > > > than expected. The poor reviewing response / rules keeps new
> > > contributors
> > > > > away from the community. I remember there was a call for
> > > > > document-improvement last year. The total timeline cost a user 3
> > months
> > > > of
> > > > > time to merge into master. It almost equals to a release interval
> of
> > > > > Pytorch.
> > > > >
> > > > > 6. To Developers. There are very few people in the community
> > discussed
> > > > the
> > > > > improvement we can take to make MXNet more user-friendly. It's been
> > so
> > > > easy
> > > > > to trigger tens of stack issues during coding. Again, is that a
> > > > requirement
> > > > > for MXNet users to be familiar with C++? The connection between
> > Python
> > > > and
> > > > > C lacks a IDE lint (maybe MXNet assume every developers as a VIM
> > > master).
> > > > > API/underlying implementation chaged frequently. People have to
> > release
> > > > > their code with an achieved version of MXNet (such as TuSimple and
> > > MSRA).
> > > > > Let's take a look at PyTorch, an API used move tensor to device
> would
> > > > raise
> > > > > a thorough discussion.
> > > > >
> > > > > There will be more comments translated to English and I will keep
> > this
> > > > > thread updated…
> > > > > Thanks,
> > > > > Qing
> > > > >
> > > >
> > >
> >
>

Re: Some feedback from MXNet Zhihu topic

Posted by kellen sunderland <ke...@gmail.com>.

Thanks for the detailed feedback Foivos.  Just want to dig into the C++
comment.

Any more details you could give on how we could improve the readability.
Would modernizing the codebase and trying to provide consistent code style
help?  In regards to documentation was it that it's mostly lacking, or that
the existing documentation could be improved (or both)?

-Kellen

On Mon, Sep 24, 2018 at 8:43 PM Foivos Diakogiannis <
phoevos.diakogiannis@gmail.com> wrote:

> Dear all,
>
> first my compliments on this great software, and thank you all for the
> effort you've put into this.
>
> I am a gluon API user, and I thought I should give some feedback to
> highlight some user-perspective issues. I am working in CSIRO and I am
> using gluon to write and deploy custom deep learning models for semantic
> segmentation/classification on CSIRO HPC facilities. I came into the deep
> learning world as of July 2017 (2nd postdoc, after astronomy), starting
> with Keras (great intro, but too simple/automated for my needs), moving on
> to TF (complexity of C++, with the inconvenience of python performance +
> memory management was bad. On the plus side, great documentation+community
> support, and of course great product overall, just not for me) and as of
> December 2017 I am using gluon exclusively since it solved the majority of
> my problems.
>
> Things I love about gluon:
> 1. Great structured tutorials (https://gluon.mxnet.io/), like a book. In
> fact, at the time of starting using Gluon, this was better (i.e. more
> structured, with a beginning and an end) than pytorch documentation.
> 2. Efficient code, both in speed and GPU memory consumption.
> 3. With a push of a button (hybridize) I can go from research to
> production. I get up to x3-4 speed up, this is a huge benefit and I don't
> see other frameworks easily beating that, in the immediate future.
> torch.jit is nowhere near the ease of use of hybridize() - not yet.
> 4. I really value the effort/support given in discuss.mxnet.io forum.
> Almost always when I have a problem I find a solution there, from experts.
> This complements my knowledge/understanding on the code around the gluon
> API.
> 5. Super easy data parallel modeling. The examples in the tutorial make
> life really easy. This made a huge difference for me, and it was the main
> reason I switched from TF to gluon.
>
> Things I find difficult with gluon:
> 1. Documentation is not in one place, so gluon-cv and gluon-nlp are things
> I've learned they exist (and they have great examples) via twitter. These
> should be on the main mxnet page, somewhere altogether (they should
> actually be advertised). In addition, some times the examples are not
> updated with the latest changes. For example,
> mynet.collect_params().initialize(...) on the gluon "book" should now be
> mynet.initialize(...), and several other examples on the same spirit. Also,
> I don't see a clear definition/description of new methods when added to
> know how to improve my code, in the release announcements. For example,
> I've learned about the block.summary(*inputs) features by checking on the
> pull requests.  Yes, it exists on the official API documentation, and I am
> used in going through all of every now and then. Can be done better.
> 2. Not all custom architectures are easy to implement in a hybrid format.
> For example, taking the shape of a layer and using this as information for
> pooling layers (or other things) is not easy (without copying to cpu
> first), and many times I have to implement many hacks to get around this
> (for performance gains). For example, here:
> https://discuss.mxnet.io/t/coordconv-layer/1394/4 Another example is the
> pyramid scene parsing network, it took me a lot of time and many hacks to
> hybridize it.
> 3. The distributed examples are not yet fully functional. When one needs to
> run distributed computing for increasing the batch size is OK-ish (under
> SLURM manager, see this:
> https://discuss.mxnet.io/t/distributed-training-questions/1269/6 ), but
> when one wants to implement async SGD - at least for me - is still an open
> problem. Of course, I completely understand that distributed training is
> still very much a research project, and I am not sure if using a large
> batch size is good for training (hence my effort to use async SGD). I've
> read various opinions on research papers for this. At the moment I am using
> distributed only for hyper parameter optimization, as I increase the batch
> size (when necessary) with delayed gradient updates.
> 4. No higher order gradient support. This is where pytorch is better, and
> where I am forced to use it in my GAN experiments for gradient penalty
> implementation ( https://github.com/apache/incubator-mxnet/issues/10002).
> I
> hope that this will change in the immediate future. It is my understanding
> that a lot of effort goes into semi-supervised training techniques and my
> gut feeling tells me that GANs are an important key ingredient to the
> solution of this problem.
>
>
> Things I really don't like about mxnet:
> 1. The documentation for C++ is not clear. I am developing code in C++ for
> the past 8 years. I am not a software engineer by training but I feel
> comfortable-ish in looking, say, in the source code of boost library or
> Eigen. I cannot say the same thing for mxnet. This is a barrier for me to
> even think contributing in C++ code.
>
> Again, many thanks for all your efforts and this awesome library!
>
> Regards,
> Foivos Diakogiannis
>
>
>
>
>
>
> On Fri, Sep 21, 2018 at 12:51 AM Timur Shenkao <ts...@timshenkao.su> wrote:
>
> > There are:
> > Gluon API
> > Module API
> > Some other apis in mxnet
> >  low-level C / C++ apis
> >
> > Recently I accidentally found that exist such things like Gluon NLP and
> > Gluon CV (besides some examples in the very MXNet).
> > It's unclear whether I can rely on some API or I have to create my own C
> /
> > C++ code.
> >
> > I implement publicly available articles and some other ideas in TF all
> the
> > time. But when it comes to MXNet, I am often reluctant because it's
> > difficult to understand which way to go. It's unclear whether my efforts
> > will result in some working model or I will get stuck.
> > Points #5 and #6 are absolutely true.
> > As for documentation, all projects in their turbulent phase of lifecycle
> > have outdated docs, it's normal. I say docs are very good (I remember
> early
> > Spark & DL4J docs 😂 )
> >
> >
> >
> > On Thursday, September 20, 2018, Tianqi Chen <tq...@cs.washington.edu>
> > wrote:
> >
> > > The key complain here is mainly about the clarity of the documents
> > > themselves. Maybe it is time to focus on a single flavor of API that is
> > > useful(Gluon) and highlight all the docs around that
> > >
> > > Tianqi
> > >
> > >
> > > On Wed, Sep 19, 2018 at 11:04 AM Qing Lan <la...@live.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > There was a trend topic<https://www.zhihu.com/question/293996867> in
> > > > Zhihu (a famous Chinese Stackoverflow+Quora) asking about the status
> of
> > > > MXNet in 2018 recently. Mu replied the thread and obtained more than
> > 300+
> > > > `like`.
> > > > However there are a few concerns addressed in the comments of this
> > > thread,
> > > > I have done some simple translation from Chinese to English:
> > > >
> > > > 1. Documentations! Until now, the online doc still contains:
> > > >                 1. Depreciated but not updated doc
> > > >                 2. Wrong documentation with poor description
> > > >                 3. Document in Alpha stage such as you must install
> > `pip
> > > > –pre` in order to run.
> > > >
> > > > 2. Examples! For Gluon specifically, many examples are still mixing
> > > > Gluon/MXNet apis. The mixure of mx.sym, mx.nd mx.gluon confused the
> > users
> > > > of what is the right one to choose in order to get their model to
> work.
> > > As
> > > > an example, Although Gluon made data encapsulation possible, still
> > there
> > > > are examples using mxn.io.ImageRecordIter with tens of params (feels
> > like
> > > > gluon examples are simply the copy from old Python examples).
> > > >
> > > > 3. Examples again! Comparing to PyTorch, there are a few examples I
> > don't
> > > > like in Gluon:
> > > >                 1. Available to run however the code structure is
> still
> > > > very complicated. Such as example/image-classification/cifar10.py. It
> > > > seemed like a consecutive code concatenation. In fact, these are
> just a
> > > > series of layers mixed with model.fit. It makes user very hard to
> > > > modify/extend the model.
> > > >                 2. Only available to run with certain settings. If
> > users
> > > > try to change a little bit in the model, crashes will happen. For
> > > example,
> > > > the multi-gpu example in Gluon website, MXNet hide the logic that
> using
> > > > batch size to change learning rate in a optimizer. A lot of newbies
> > > didn't
> > > > know this fact and they would only find that the model stopped
> > converging
> > > > when batch size changed.
> > > >                 3. The worst scenario is the model itself just simply
> > > > didn't work. Maintainers in the MXNet community didn't run the model
> > > (even
> > > > no integration test) and merge the code directly. It makes the script
> > not
> > > > able run till somebody raise the issues and fix it.
> > > >
> > > > 4. The Community problem. The core advantage for MXNet is it's
> > > scalability
> > > > and efficiency. However, the documentation of some tools are
> confusing.
> > > > Here are two examples:
> > > >
> > > >                 1. im2rec contains 2 versions, C++ (binary) and
> python.
> > > > But nobody would thought that the argparse in these tools are
> different
> > > (in
> > > > the meantime, there is no appropriate examples to compare with, users
> > > could
> > > > only use them by guessing the usage).
> > > >
> > > >                 2. How to combine MXNet distributed platform with
> > > > supercomputing tool such as Slurm? How do we do profiling and how to
> > > debug.
> > > > A couples of companies I knew thought of using MXNet for distributed
> > > > training. Due to lack of examples and poor support from the
> community,
> > > they
> > > > have to change their models into TensorFlow and Horovod.
> > > >
> > > > 5. The heavy code base. Most of the MXNet examples/source
> > > > code/documentation/language binding are in a single repo. A git clone
> > > > operation will cost tens of Mb. The New feature PR would takes longer
> > > time
> > > > than expected. The poor reviewing response / rules keeps new
> > contributors
> > > > away from the community. I remember there was a call for
> > > > document-improvement last year. The total timeline cost a user 3
> months
> > > of
> > > > time to merge into master. It almost equals to a release interval of
> > > > Pytorch.
> > > >
> > > > 6. To Developers. There are very few people in the community
> discussed
> > > the
> > > > improvement we can take to make MXNet more user-friendly. It's been
> so
> > > easy
> > > > to trigger tens of stack issues during coding. Again, is that a
> > > requirement
> > > > for MXNet users to be familiar with C++? The connection between
> Python
> > > and
> > > > C lacks a IDE lint (maybe MXNet assume every developers as a VIM
> > master).
> > > > API/underlying implementation chaged frequently. People have to
> release
> > > > their code with an achieved version of MXNet (such as TuSimple and
> > MSRA).
> > > > Let's take a look at PyTorch, an API used move tensor to device would
> > > raise
> > > > a thorough discussion.
> > > >
> > > > There will be more comments translated to English and I will keep
> this
> > > > thread updated…
> > > > Thanks,
> > > > Qing
> > > >
> > >
> >
>

Re: Some feedback from MXNet Zhihu topic

Posted by Foivos Diakogiannis <ph...@gmail.com>.

Dear all,

first my compliments on this great software, and thank you all for the
effort you've put into this.

I am a gluon API user, and I thought I should give some feedback to
highlight some user-perspective issues. I am working in CSIRO and I am
using gluon to write and deploy custom deep learning models for semantic
segmentation/classification on CSIRO HPC facilities. I came into the deep
learning world as of July 2017 (2nd postdoc, after astronomy), starting
with Keras (great intro, but too simple/automated for my needs), moving on
to TF (complexity of C++, with the inconvenience of python performance +
memory management was bad. On the plus side, great documentation+community
support, and of course great product overall, just not for me) and as of
December 2017 I am using gluon exclusively since it solved the majority of
my problems.

Things I love about gluon:
1. Great structured tutorials (https://gluon.mxnet.io/), like a book. In
fact, at the time of starting using Gluon, this was better (i.e. more
structured, with a beginning and an end) than pytorch documentation.
2. Efficient code, both in speed and GPU memory consumption.
3. With a push of a button (hybridize) I can go from research to
production. I get up to x3-4 speed up, this is a huge benefit and I don't
see other frameworks easily beating that, in the immediate future.
torch.jit is nowhere near the ease of use of hybridize() - not yet.
4. I really value the effort/support given in discuss.mxnet.io forum.
Almost always when I have a problem I find a solution there, from experts.
This complements my knowledge/understanding on the code around the gluon
API.
5. Super easy data parallel modeling. The examples in the tutorial make
life really easy. This made a huge difference for me, and it was the main
reason I switched from TF to gluon.

Things I find difficult with gluon:
1. Documentation is not in one place, so gluon-cv and gluon-nlp are things
I've learned they exist (and they have great examples) via twitter. These
should be on the main mxnet page, somewhere altogether (they should
actually be advertised). In addition, some times the examples are not
updated with the latest changes. For example,
mynet.collect_params().initialize(...) on the gluon "book" should now be
mynet.initialize(...), and several other examples on the same spirit. Also,
I don't see a clear definition/description of new methods when added to
know how to improve my code, in the release announcements. For example,
I've learned about the block.summary(*inputs) features by checking on the
pull requests.  Yes, it exists on the official API documentation, and I am
used in going through all of every now and then. Can be done better.
2. Not all custom architectures are easy to implement in a hybrid format.
For example, taking the shape of a layer and using this as information for
pooling layers (or other things) is not easy (without copying to cpu
first), and many times I have to implement many hacks to get around this
(for performance gains). For example, here:
https://discuss.mxnet.io/t/coordconv-layer/1394/4 Another example is the
pyramid scene parsing network, it took me a lot of time and many hacks to
hybridize it.
3. The distributed examples are not yet fully functional. When one needs to
run distributed computing for increasing the batch size is OK-ish (under
SLURM manager, see this:
https://discuss.mxnet.io/t/distributed-training-questions/1269/6 ), but
when one wants to implement async SGD - at least for me - is still an open
problem. Of course, I completely understand that distributed training is
still very much a research project, and I am not sure if using a large
batch size is good for training (hence my effort to use async SGD). I've
read various opinions on research papers for this. At the moment I am using
distributed only for hyper parameter optimization, as I increase the batch
size (when necessary) with delayed gradient updates.
4. No higher order gradient support. This is where pytorch is better, and
where I am forced to use it in my GAN experiments for gradient penalty
implementation ( https://github.com/apache/incubator-mxnet/issues/10002). I
hope that this will change in the immediate future. It is my understanding
that a lot of effort goes into semi-supervised training techniques and my
gut feeling tells me that GANs are an important key ingredient to the
solution of this problem.

Things I really don't like about mxnet:
1. The documentation for C++ is not clear. I am developing code in C++ for
the past 8 years. I am not a software engineer by training but I feel
comfortable-ish in looking, say, in the source code of boost library or
Eigen. I cannot say the same thing for mxnet. This is a barrier for me to
even think contributing in C++ code.

Again, many thanks for all your efforts and this awesome library!

Regards,
Foivos Diakogiannis

On Fri, Sep 21, 2018 at 12:51 AM Timur Shenkao <ts...@timshenkao.su> wrote:

> There are:
> Gluon API
> Module API
> Some other apis in mxnet
>  low-level C / C++ apis
>
> Recently I accidentally found that exist such things like Gluon NLP and
> Gluon CV (besides some examples in the very MXNet).
> It's unclear whether I can rely on some API or I have to create my own C /
> C++ code.
>
> I implement publicly available articles and some other ideas in TF all the
> time. But when it comes to MXNet, I am often reluctant because it's
> difficult to understand which way to go. It's unclear whether my efforts
> will result in some working model or I will get stuck.
> Points #5 and #6 are absolutely true.
> As for documentation, all projects in their turbulent phase of lifecycle
> have outdated docs, it's normal. I say docs are very good (I remember early
> Spark & DL4J docs 😂 )
>
>
>
> On Thursday, September 20, 2018, Tianqi Chen <tq...@cs.washington.edu>
> wrote:
>
> > The key complain here is mainly about the clarity of the documents
> > themselves. Maybe it is time to focus on a single flavor of API that is
> > useful(Gluon) and highlight all the docs around that
> >
> > Tianqi
> >
> >
> > On Wed, Sep 19, 2018 at 11:04 AM Qing Lan <la...@live.com> wrote:
> >
> > > Hi all,
> > >
> > > There was a trend topic<https://www.zhihu.com/question/293996867> in
> > > Zhihu (a famous Chinese Stackoverflow+Quora) asking about the status of
> > > MXNet in 2018 recently. Mu replied the thread and obtained more than
> 300+
> > > `like`.
> > > However there are a few concerns addressed in the comments of this
> > thread,
> > > I have done some simple translation from Chinese to English:
> > >
> > > 1. Documentations! Until now, the online doc still contains:
> > >                 1. Depreciated but not updated doc
> > >                 2. Wrong documentation with poor description
> > >                 3. Document in Alpha stage such as you must install
> `pip
> > > –pre` in order to run.
> > >
> > > 2. Examples! For Gluon specifically, many examples are still mixing
> > > Gluon/MXNet apis. The mixure of mx.sym, mx.nd mx.gluon confused the
> users
> > > of what is the right one to choose in order to get their model to work.
> > As
> > > an example, Although Gluon made data encapsulation possible, still
> there
> > > are examples using mxn.io.ImageRecordIter with tens of params (feels
> like
> > > gluon examples are simply the copy from old Python examples).
> > >
> > > 3. Examples again! Comparing to PyTorch, there are a few examples I
> don't
> > > like in Gluon:
> > >                 1. Available to run however the code structure is still
> > > very complicated. Such as example/image-classification/cifar10.py. It
> > > seemed like a consecutive code concatenation. In fact, these are just a
> > > series of layers mixed with model.fit. It makes user very hard to
> > > modify/extend the model.
> > >                 2. Only available to run with certain settings. If
> users
> > > try to change a little bit in the model, crashes will happen. For
> > example,
> > > the multi-gpu example in Gluon website, MXNet hide the logic that using
> > > batch size to change learning rate in a optimizer. A lot of newbies
> > didn't
> > > know this fact and they would only find that the model stopped
> converging
> > > when batch size changed.
> > >                 3. The worst scenario is the model itself just simply
> > > didn't work. Maintainers in the MXNet community didn't run the model
> > (even
> > > no integration test) and merge the code directly. It makes the script
> not
> > > able run till somebody raise the issues and fix it.
> > >
> > > 4. The Community problem. The core advantage for MXNet is it's
> > scalability
> > > and efficiency. However, the documentation of some tools are confusing.
> > > Here are two examples:
> > >
> > >                 1. im2rec contains 2 versions, C++ (binary) and python.
> > > But nobody would thought that the argparse in these tools are different
> > (in
> > > the meantime, there is no appropriate examples to compare with, users
> > could
> > > only use them by guessing the usage).
> > >
> > >                 2. How to combine MXNet distributed platform with
> > > supercomputing tool such as Slurm? How do we do profiling and how to
> > debug.
> > > A couples of companies I knew thought of using MXNet for distributed
> > > training. Due to lack of examples and poor support from the community,
> > they
> > > have to change their models into TensorFlow and Horovod.
> > >
> > > 5. The heavy code base. Most of the MXNet examples/source
> > > code/documentation/language binding are in a single repo. A git clone
> > > operation will cost tens of Mb. The New feature PR would takes longer
> > time
> > > than expected. The poor reviewing response / rules keeps new
> contributors
> > > away from the community. I remember there was a call for
> > > document-improvement last year. The total timeline cost a user 3 months
> > of
> > > time to merge into master. It almost equals to a release interval of
> > > Pytorch.
> > >
> > > 6. To Developers. There are very few people in the community discussed
> > the
> > > improvement we can take to make MXNet more user-friendly. It's been so
> > easy
> > > to trigger tens of stack issues during coding. Again, is that a
> > requirement
> > > for MXNet users to be familiar with C++? The connection between Python
> > and
> > > C lacks a IDE lint (maybe MXNet assume every developers as a VIM
> master).
> > > API/underlying implementation chaged frequently. People have to release
> > > their code with an achieved version of MXNet (such as TuSimple and
> MSRA).
> > > Let's take a look at PyTorch, an API used move tensor to device would
> > raise
> > > a thorough discussion.
> > >
> > > There will be more comments translated to English and I will keep this
> > > thread updated…
> > > Thanks,
> > > Qing
> > >
> >
>