You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by Anirudh <an...@gmail.com> on 2018/03/01 07:50:03 UTC

Re: UTF-8 Support for TextParser

Hi Tianqi,

What do you think about adding a separate parser for CSV with UTF8 support
in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
UTF8 or the ASCII parser based on this flag. (This idea was suggested by
Mu).

I think there will be some small changes required to the base class
"TextParserBase" as the method "BackFindEndLine" will have more logic in it
to check for other code-points for line-breaks, which can be refactored.
This approach will likely retain the performance of the existing ASCII CSV
Parser, while allowing MXNet users to make the decision w.r.t usability
with UTF-8 CSV parser / performance with ASCII CSV parser.

Thanks,
Anirudh


On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <an...@gmail.com> wrote:

> Hi Marco,
>
> I understand that there needs to be a different discussion on strong
> dependency of mxnet and dmlc-core and how to fix it.
>
> Having said that, I think the goals of dmlc-core and mxnet are somewhat
> aligned. Posting in the MXNet dev list for this case
> is a good way to gather feedback from both the communities since I
> consider the MXNet community to be mostly a superset of the dmlc-core
> community.
>
> Anirudh
>
> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <an...@amazon.com>
> wrote:
>
>> Hi Tianqi,
>>
>> The UTF-8 support would enable other formats like CSV more usable.
>> Otherwise, they have to handle normalizing their data in some way before
>> using mxnet.
>> I understand that there is a tradeoff here because of the efficiency
>> gains from the parser but the expectation of having to normalize their UTF-8
>> files may turn users away.
>>
>> Anirudh
>>
>> On 2/26/18, 3:54 PM, "workcrow@gmail.com on behalf of Tianqi Chen" <
>> workcrow@gmail.com on behalf of tqchen@cs.washington.edu> wrote:
>>
>>     Since LibSVM format is only going to involve numbers and possibly
>> ascii
>>     characters, is there any reason adding UTF-8 support? Note that
>>     generalization always comes with cost of efficiency and there is some
>>     effort spent on making parser fast
>>
>>     Tianqi
>>
>>     On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <an...@gmail.com>
>> wrote:
>>
>>     > Hi all,
>>     >
>>     > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
>> parsers.
>>     > I am currently working on adding UTF-8 support for Text parsers.
>> Since C++
>>     > doesn't have a great built-in support for UTF-8, I am looking at
>>     > third-party libraries which provide Unicode support. I am
>> considering ICU
>>     > currently. Any comments, suggestions, past experience, gotchas about
>>     > unicode third party libraries or adding unicode support in general
>> is
>>     > highly appreciated.
>>     >
>>     > I have created an issue about the same:
>>     > https://github.com/dmlc/dmlc-core/issues/372
>>     > Please feel free to reply to this email or comment on the github
>> issue if
>>     > you have any inputs.
>>     >
>>     > Anirudh
>>     >
>>
>>
>>
>

Re: UTF-8 Support for TextParser

Posted by Anirudh <an...@gmail.com>.

Won't run entire text through converter, will just ignore the BOM character
during parsing stage.

Anirudh

On Fri, Mar 9, 2018 at 1:12 PM, Chris Olivier <cj...@gmail.com> wrote:

> For this, are you going to run the entire text through a converter, or just
> prepend the UTF-8 header to the file (0xEF,0xBB,0xBF)?
>
> On Fri, Mar 9, 2018 at 12:43 PM, Anirudh <an...@gmail.com> wrote:
>
> > Hi,
> >
> > Upon deeper understanding of customer requirement we found out that the
> > customer uses only ASCII data with MXNet, just that they want the files
> > containing UTF-8 BOM at the start and files with different control
> > characters for newline to play well. dmlc-core already supports control
> > characters for newline.
> > Since, the UTF-8 BOM in files is a common use case for other users of
> MXNet
> > too (for example, saving excel as UTF-8 csv) I will add support for
> > handling the UTF-8 BOM in dmlc-core.
> > I won't be working on UTF8CSVParser unless there is a customer
> requirement
> > that comes up later on.
> >
> > Anirudh
> >
> >
> >
> > On Wed, Feb 28, 2018 at 11:50 PM, Anirudh <an...@gmail.com> wrote:
> >
> > > Hi Tianqi,
> > >
> > > What do you think about adding a separate parser for CSV with UTF8
> > support
> > > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
> > > UTF8 or the ASCII parser based on this flag. (This idea was suggested
> by
> > > Mu).
> > >
> > > I think there will be some small changes required to the base class
> > > "TextParserBase" as the method "BackFindEndLine" will have more logic
> in
> > it
> > > to check for other code-points for line-breaks, which can be
> refactored.
> > > This approach will likely retain the performance of the existing ASCII
> > CSV
> > > Parser, while allowing MXNet users to make the decision w.r.t usability
> > > with UTF-8 CSV parser / performance with ASCII CSV parser.
> > >
> > > Thanks,
> > > Anirudh
> > >
> > >
> > > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <an...@gmail.com>
> wrote:
> > >
> > >> Hi Marco,
> > >>
> > >> I understand that there needs to be a different discussion on strong
> > >> dependency of mxnet and dmlc-core and how to fix it.
> > >>
> > >> Having said that, I think the goals of dmlc-core and mxnet are
> somewhat
> > >> aligned. Posting in the MXNet dev list for this case
> > >> is a good way to gather feedback from both the communities since I
> > >> consider the MXNet community to be mostly a superset of the dmlc-core
> > >> community.
> > >>
> > >> Anirudh
> > >>
> > >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <
> > anisub@amazon.com>
> > >> wrote:
> > >>
> > >>> Hi Tianqi,
> > >>>
> > >>> The UTF-8 support would enable other formats like CSV more usable.
> > >>> Otherwise, they have to handle normalizing their data in some way
> > before
> > >>> using mxnet.
> > >>> I understand that there is a tradeoff here because of the efficiency
> > >>> gains from the parser but the expectation of having to normalize
> their
> > UTF-8
> > >>> files may turn users away.
> > >>>
> > >>> Anirudh
> > >>>
> > >>> On 2/26/18, 3:54 PM, "workcrow@gmail.com on behalf of Tianqi Chen" <
> > >>> workcrow@gmail.com on behalf of tqchen@cs.washington.edu> wrote:
> > >>>
> > >>>     Since LibSVM format is only going to involve numbers and possibly
> > >>> ascii
> > >>>     characters, is there any reason adding UTF-8 support? Note that
> > >>>     generalization always comes with cost of efficiency and there is
> > some
> > >>>     effort spent on making parser fast
> > >>>
> > >>>     Tianqi
> > >>>
> > >>>     On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <an...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>     > Hi all,
> > >>>     >
> > >>>     > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV
> Text
> > >>> parsers.
> > >>>     > I am currently working on adding UTF-8 support for Text
> parsers.
> > >>> Since C++
> > >>>     > doesn't have a great built-in support for UTF-8, I am looking
> at
> > >>>     > third-party libraries which provide Unicode support. I am
> > >>> considering ICU
> > >>>     > currently. Any comments, suggestions, past experience, gotchas
> > >>> about
> > >>>     > unicode third party libraries or adding unicode support in
> > general
> > >>> is
> > >>>     > highly appreciated.
> > >>>     >
> > >>>     > I have created an issue about the same:
> > >>>     > https://github.com/dmlc/dmlc-core/issues/372
> > >>>     > Please feel free to reply to this email or comment on the
> github
> > >>> issue if
> > >>>     > you have any inputs.
> > >>>     >
> > >>>     > Anirudh
> > >>>     >
> > >>>
> > >>>
> > >>>
> > >>
> > >
> >
>

Re: UTF-8 Support for TextParser

Posted by Chris Olivier <cj...@gmail.com>.

For this, are you going to run the entire text through a converter, or just
prepend the UTF-8 header to the file (0xEF,0xBB,0xBF)?

On Fri, Mar 9, 2018 at 12:43 PM, Anirudh <an...@gmail.com> wrote:

> Hi,
>
> Upon deeper understanding of customer requirement we found out that the
> customer uses only ASCII data with MXNet, just that they want the files
> containing UTF-8 BOM at the start and files with different control
> characters for newline to play well. dmlc-core already supports control
> characters for newline.
> Since, the UTF-8 BOM in files is a common use case for other users of MXNet
> too (for example, saving excel as UTF-8 csv) I will add support for
> handling the UTF-8 BOM in dmlc-core.
> I won't be working on UTF8CSVParser unless there is a customer requirement
> that comes up later on.
>
> Anirudh
>
>
>
> On Wed, Feb 28, 2018 at 11:50 PM, Anirudh <an...@gmail.com> wrote:
>
> > Hi Tianqi,
> >
> > What do you think about adding a separate parser for CSV with UTF8
> support
> > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
> > UTF8 or the ASCII parser based on this flag. (This idea was suggested by
> > Mu).
> >
> > I think there will be some small changes required to the base class
> > "TextParserBase" as the method "BackFindEndLine" will have more logic in
> it
> > to check for other code-points for line-breaks, which can be refactored.
> > This approach will likely retain the performance of the existing ASCII
> CSV
> > Parser, while allowing MXNet users to make the decision w.r.t usability
> > with UTF-8 CSV parser / performance with ASCII CSV parser.
> >
> > Thanks,
> > Anirudh
> >
> >
> > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <an...@gmail.com> wrote:
> >
> >> Hi Marco,
> >>
> >> I understand that there needs to be a different discussion on strong
> >> dependency of mxnet and dmlc-core and how to fix it.
> >>
> >> Having said that, I think the goals of dmlc-core and mxnet are somewhat
> >> aligned. Posting in the MXNet dev list for this case
> >> is a good way to gather feedback from both the communities since I
> >> consider the MXNet community to be mostly a superset of the dmlc-core
> >> community.
> >>
> >> Anirudh
> >>
> >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <
> anisub@amazon.com>
> >> wrote:
> >>
> >>> Hi Tianqi,
> >>>
> >>> The UTF-8 support would enable other formats like CSV more usable.
> >>> Otherwise, they have to handle normalizing their data in some way
> before
> >>> using mxnet.
> >>> I understand that there is a tradeoff here because of the efficiency
> >>> gains from the parser but the expectation of having to normalize their
> UTF-8
> >>> files may turn users away.
> >>>
> >>> Anirudh
> >>>
> >>> On 2/26/18, 3:54 PM, "workcrow@gmail.com on behalf of Tianqi Chen" <
> >>> workcrow@gmail.com on behalf of tqchen@cs.washington.edu> wrote:
> >>>
> >>>     Since LibSVM format is only going to involve numbers and possibly
> >>> ascii
> >>>     characters, is there any reason adding UTF-8 support? Note that
> >>>     generalization always comes with cost of efficiency and there is
> some
> >>>     effort spent on making parser fast
> >>>
> >>>     Tianqi
> >>>
> >>>     On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <an...@gmail.com>
> >>> wrote:
> >>>
> >>>     > Hi all,
> >>>     >
> >>>     > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
> >>> parsers.
> >>>     > I am currently working on adding UTF-8 support for Text parsers.
> >>> Since C++
> >>>     > doesn't have a great built-in support for UTF-8, I am looking at
> >>>     > third-party libraries which provide Unicode support. I am
> >>> considering ICU
> >>>     > currently. Any comments, suggestions, past experience, gotchas
> >>> about
> >>>     > unicode third party libraries or adding unicode support in
> general
> >>> is
> >>>     > highly appreciated.
> >>>     >
> >>>     > I have created an issue about the same:
> >>>     > https://github.com/dmlc/dmlc-core/issues/372
> >>>     > Please feel free to reply to this email or comment on the github
> >>> issue if
> >>>     > you have any inputs.
> >>>     >
> >>>     > Anirudh
> >>>     >
> >>>
> >>>
> >>>
> >>
> >
>

Re: UTF-8 Support for TextParser

Posted by Anirudh <an...@gmail.com>.

Hi,

Upon deeper understanding of customer requirement we found out that the
customer uses only ASCII data with MXNet, just that they want the files
containing UTF-8 BOM at the start and files with different control
characters for newline to play well. dmlc-core already supports control
characters for newline.
Since, the UTF-8 BOM in files is a common use case for other users of MXNet
too (for example, saving excel as UTF-8 csv) I will add support for
handling the UTF-8 BOM in dmlc-core.
I won't be working on UTF8CSVParser unless there is a customer requirement
that comes up later on.

Anirudh



On Wed, Feb 28, 2018 at 11:50 PM, Anirudh <an...@gmail.com> wrote:

> Hi Tianqi,
>
> What do you think about adding a separate parser for CSV with UTF8 support
> in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
> UTF8 or the ASCII parser based on this flag. (This idea was suggested by
> Mu).
>
> I think there will be some small changes required to the base class
> "TextParserBase" as the method "BackFindEndLine" will have more logic in it
> to check for other code-points for line-breaks, which can be refactored.
> This approach will likely retain the performance of the existing ASCII CSV
> Parser, while allowing MXNet users to make the decision w.r.t usability
> with UTF-8 CSV parser / performance with ASCII CSV parser.
>
> Thanks,
> Anirudh
>
>
> On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <an...@gmail.com> wrote:
>
>> Hi Marco,
>>
>> I understand that there needs to be a different discussion on strong
>> dependency of mxnet and dmlc-core and how to fix it.
>>
>> Having said that, I think the goals of dmlc-core and mxnet are somewhat
>> aligned. Posting in the MXNet dev list for this case
>> is a good way to gather feedback from both the communities since I
>> consider the MXNet community to be mostly a superset of the dmlc-core
>> community.
>>
>> Anirudh
>>
>> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <an...@amazon.com>
>> wrote:
>>
>>> Hi Tianqi,
>>>
>>> The UTF-8 support would enable other formats like CSV more usable.
>>> Otherwise, they have to handle normalizing their data in some way before
>>> using mxnet.
>>> I understand that there is a tradeoff here because of the efficiency
>>> gains from the parser but the expectation of having to normalize their UTF-8
>>> files may turn users away.
>>>
>>> Anirudh
>>>
>>> On 2/26/18, 3:54 PM, "workcrow@gmail.com on behalf of Tianqi Chen" <
>>> workcrow@gmail.com on behalf of tqchen@cs.washington.edu> wrote:
>>>
>>>     Since LibSVM format is only going to involve numbers and possibly
>>> ascii
>>>     characters, is there any reason adding UTF-8 support? Note that
>>>     generalization always comes with cost of efficiency and there is some
>>>     effort spent on making parser fast
>>>
>>>     Tianqi
>>>
>>>     On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <an...@gmail.com>
>>> wrote:
>>>
>>>     > Hi all,
>>>     >
>>>     > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
>>> parsers.
>>>     > I am currently working on adding UTF-8 support for Text parsers.
>>> Since C++
>>>     > doesn't have a great built-in support for UTF-8, I am looking at
>>>     > third-party libraries which provide Unicode support. I am
>>> considering ICU
>>>     > currently. Any comments, suggestions, past experience, gotchas
>>> about
>>>     > unicode third party libraries or adding unicode support in general
>>> is
>>>     > highly appreciated.
>>>     >
>>>     > I have created an issue about the same:
>>>     > https://github.com/dmlc/dmlc-core/issues/372
>>>     > Please feel free to reply to this email or comment on the github
>>> issue if
>>>     > you have any inputs.
>>>     >
>>>     > Anirudh
>>>     >
>>>
>>>
>>>
>>
>