You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mxnet.apache.org by Anirudh <an...@gmail.com> on 2018/03/01 07:50:03 UTC
Re: UTF-8 Support for TextParser
Hi Tianqi,
What do you think about adding a separate parser for CSV with UTF8 support
in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
UTF8 or the ASCII parser based on this flag. (This idea was suggested by
Mu).
I think there will be some small changes required to the base class
"TextParserBase" as the method "BackFindEndLine" will have more logic in it
to check for other code-points for line-breaks, which can be refactored.
This approach will likely retain the performance of the existing ASCII CSV
Parser, while allowing MXNet users to make the decision w.r.t usability
with UTF-8 CSV parser / performance with ASCII CSV parser.
Thanks,
Anirudh
On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <an...@gmail.com> wrote:
> Hi Marco,
>
> I understand that there needs to be a different discussion on strong
> dependency of mxnet and dmlc-core and how to fix it.
>
> Having said that, I think the goals of dmlc-core and mxnet are somewhat
> aligned. Posting in the MXNet dev list for this case
> is a good way to gather feedback from both the communities since I
> consider the MXNet community to be mostly a superset of the dmlc-core
> community.
>
> Anirudh
>
> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <an...@amazon.com>
> wrote:
>
>> Hi Tianqi,
>>
>> The UTF-8 support would enable other formats like CSV more usable.
>> Otherwise, they have to handle normalizing their data in some way before
>> using mxnet.
>> I understand that there is a tradeoff here because of the efficiency
>> gains from the parser but the expectation of having to normalize their UTF-8
>> files may turn users away.
>>
>> Anirudh
>>
>> On 2/26/18, 3:54 PM, "workcrow@gmail.com on behalf of Tianqi Chen" <
>> workcrow@gmail.com on behalf of tqchen@cs.washington.edu> wrote:
>>
>> Since LibSVM format is only going to involve numbers and possibly
>> ascii
>> characters, is there any reason adding UTF-8 support? Note that
>> generalization always comes with cost of efficiency and there is some
>> effort spent on making parser fast
>>
>> Tianqi
>>
>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <an...@gmail.com>
>> wrote:
>>
>> > Hi all,
>> >
>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
>> parsers.
>> > I am currently working on adding UTF-8 support for Text parsers.
>> Since C++
>> > doesn't have a great built-in support for UTF-8, I am looking at
>> > third-party libraries which provide Unicode support. I am
>> considering ICU
>> > currently. Any comments, suggestions, past experience, gotchas about
>> > unicode third party libraries or adding unicode support in general
>> is
>> > highly appreciated.
>> >
>> > I have created an issue about the same:
>> > https://github.com/dmlc/dmlc-core/issues/372
>> > Please feel free to reply to this email or comment on the github
>> issue if
>> > you have any inputs.
>> >
>> > Anirudh
>> >
>>
>>
>>
>
Re: UTF-8 Support for TextParser
Posted by Anirudh <an...@gmail.com>.
Won't run entire text through converter, will just ignore the BOM character
during parsing stage.
Anirudh
On Fri, Mar 9, 2018 at 1:12 PM, Chris Olivier <cj...@gmail.com> wrote:
> For this, are you going to run the entire text through a converter, or just
> prepend the UTF-8 header to the file (0xEF,0xBB,0xBF)?
>
> On Fri, Mar 9, 2018 at 12:43 PM, Anirudh <an...@gmail.com> wrote:
>
> > Hi,
> >
> > Upon deeper understanding of customer requirement we found out that the
> > customer uses only ASCII data with MXNet, just that they want the files
> > containing UTF-8 BOM at the start and files with different control
> > characters for newline to play well. dmlc-core already supports control
> > characters for newline.
> > Since, the UTF-8 BOM in files is a common use case for other users of
> MXNet
> > too (for example, saving excel as UTF-8 csv) I will add support for
> > handling the UTF-8 BOM in dmlc-core.
> > I won't be working on UTF8CSVParser unless there is a customer
> requirement
> > that comes up later on.
> >
> > Anirudh
> >
> >
> >
> > On Wed, Feb 28, 2018 at 11:50 PM, Anirudh <an...@gmail.com> wrote:
> >
> > > Hi Tianqi,
> > >
> > > What do you think about adding a separate parser for CSV with UTF8
> > support
> > > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
> > > UTF8 or the ASCII parser based on this flag. (This idea was suggested
> by
> > > Mu).
> > >
> > > I think there will be some small changes required to the base class
> > > "TextParserBase" as the method "BackFindEndLine" will have more logic
> in
> > it
> > > to check for other code-points for line-breaks, which can be
> refactored.
> > > This approach will likely retain the performance of the existing ASCII
> > CSV
> > > Parser, while allowing MXNet users to make the decision w.r.t usability
> > > with UTF-8 CSV parser / performance with ASCII CSV parser.
> > >
> > > Thanks,
> > > Anirudh
> > >
> > >
> > > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <an...@gmail.com>
> wrote:
> > >
> > >> Hi Marco,
> > >>
> > >> I understand that there needs to be a different discussion on strong
> > >> dependency of mxnet and dmlc-core and how to fix it.
> > >>
> > >> Having said that, I think the goals of dmlc-core and mxnet are
> somewhat
> > >> aligned. Posting in the MXNet dev list for this case
> > >> is a good way to gather feedback from both the communities since I
> > >> consider the MXNet community to be mostly a superset of the dmlc-core
> > >> community.
> > >>
> > >> Anirudh
> > >>
> > >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <
> > anisub@amazon.com>
> > >> wrote:
> > >>
> > >>> Hi Tianqi,
> > >>>
> > >>> The UTF-8 support would enable other formats like CSV more usable.
> > >>> Otherwise, they have to handle normalizing their data in some way
> > before
> > >>> using mxnet.
> > >>> I understand that there is a tradeoff here because of the efficiency
> > >>> gains from the parser but the expectation of having to normalize
> their
> > UTF-8
> > >>> files may turn users away.
> > >>>
> > >>> Anirudh
> > >>>
> > >>> On 2/26/18, 3:54 PM, "workcrow@gmail.com on behalf of Tianqi Chen" <
> > >>> workcrow@gmail.com on behalf of tqchen@cs.washington.edu> wrote:
> > >>>
> > >>> Since LibSVM format is only going to involve numbers and possibly
> > >>> ascii
> > >>> characters, is there any reason adding UTF-8 support? Note that
> > >>> generalization always comes with cost of efficiency and there is
> > some
> > >>> effort spent on making parser fast
> > >>>
> > >>> Tianqi
> > >>>
> > >>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <an...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > Hi all,
> > >>> >
> > >>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV
> Text
> > >>> parsers.
> > >>> > I am currently working on adding UTF-8 support for Text
> parsers.
> > >>> Since C++
> > >>> > doesn't have a great built-in support for UTF-8, I am looking
> at
> > >>> > third-party libraries which provide Unicode support. I am
> > >>> considering ICU
> > >>> > currently. Any comments, suggestions, past experience, gotchas
> > >>> about
> > >>> > unicode third party libraries or adding unicode support in
> > general
> > >>> is
> > >>> > highly appreciated.
> > >>> >
> > >>> > I have created an issue about the same:
> > >>> > https://github.com/dmlc/dmlc-core/issues/372
> > >>> > Please feel free to reply to this email or comment on the
> github
> > >>> issue if
> > >>> > you have any inputs.
> > >>> >
> > >>> > Anirudh
> > >>> >
> > >>>
> > >>>
> > >>>
> > >>
> > >
> >
>
Re: UTF-8 Support for TextParser
Posted by Chris Olivier <cj...@gmail.com>.
For this, are you going to run the entire text through a converter, or just
prepend the UTF-8 header to the file (0xEF,0xBB,0xBF)?
On Fri, Mar 9, 2018 at 12:43 PM, Anirudh <an...@gmail.com> wrote:
> Hi,
>
> Upon deeper understanding of customer requirement we found out that the
> customer uses only ASCII data with MXNet, just that they want the files
> containing UTF-8 BOM at the start and files with different control
> characters for newline to play well. dmlc-core already supports control
> characters for newline.
> Since, the UTF-8 BOM in files is a common use case for other users of MXNet
> too (for example, saving excel as UTF-8 csv) I will add support for
> handling the UTF-8 BOM in dmlc-core.
> I won't be working on UTF8CSVParser unless there is a customer requirement
> that comes up later on.
>
> Anirudh
>
>
>
> On Wed, Feb 28, 2018 at 11:50 PM, Anirudh <an...@gmail.com> wrote:
>
> > Hi Tianqi,
> >
> > What do you think about adding a separate parser for CSV with UTF8
> support
> > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
> > UTF8 or the ASCII parser based on this flag. (This idea was suggested by
> > Mu).
> >
> > I think there will be some small changes required to the base class
> > "TextParserBase" as the method "BackFindEndLine" will have more logic in
> it
> > to check for other code-points for line-breaks, which can be refactored.
> > This approach will likely retain the performance of the existing ASCII
> CSV
> > Parser, while allowing MXNet users to make the decision w.r.t usability
> > with UTF-8 CSV parser / performance with ASCII CSV parser.
> >
> > Thanks,
> > Anirudh
> >
> >
> > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <an...@gmail.com> wrote:
> >
> >> Hi Marco,
> >>
> >> I understand that there needs to be a different discussion on strong
> >> dependency of mxnet and dmlc-core and how to fix it.
> >>
> >> Having said that, I think the goals of dmlc-core and mxnet are somewhat
> >> aligned. Posting in the MXNet dev list for this case
> >> is a good way to gather feedback from both the communities since I
> >> consider the MXNet community to be mostly a superset of the dmlc-core
> >> community.
> >>
> >> Anirudh
> >>
> >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <
> anisub@amazon.com>
> >> wrote:
> >>
> >>> Hi Tianqi,
> >>>
> >>> The UTF-8 support would enable other formats like CSV more usable.
> >>> Otherwise, they have to handle normalizing their data in some way
> before
> >>> using mxnet.
> >>> I understand that there is a tradeoff here because of the efficiency
> >>> gains from the parser but the expectation of having to normalize their
> UTF-8
> >>> files may turn users away.
> >>>
> >>> Anirudh
> >>>
> >>> On 2/26/18, 3:54 PM, "workcrow@gmail.com on behalf of Tianqi Chen" <
> >>> workcrow@gmail.com on behalf of tqchen@cs.washington.edu> wrote:
> >>>
> >>> Since LibSVM format is only going to involve numbers and possibly
> >>> ascii
> >>> characters, is there any reason adding UTF-8 support? Note that
> >>> generalization always comes with cost of efficiency and there is
> some
> >>> effort spent on making parser fast
> >>>
> >>> Tianqi
> >>>
> >>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <an...@gmail.com>
> >>> wrote:
> >>>
> >>> > Hi all,
> >>> >
> >>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
> >>> parsers.
> >>> > I am currently working on adding UTF-8 support for Text parsers.
> >>> Since C++
> >>> > doesn't have a great built-in support for UTF-8, I am looking at
> >>> > third-party libraries which provide Unicode support. I am
> >>> considering ICU
> >>> > currently. Any comments, suggestions, past experience, gotchas
> >>> about
> >>> > unicode third party libraries or adding unicode support in
> general
> >>> is
> >>> > highly appreciated.
> >>> >
> >>> > I have created an issue about the same:
> >>> > https://github.com/dmlc/dmlc-core/issues/372
> >>> > Please feel free to reply to this email or comment on the github
> >>> issue if
> >>> > you have any inputs.
> >>> >
> >>> > Anirudh
> >>> >
> >>>
> >>>
> >>>
> >>
> >
>
Re: UTF-8 Support for TextParser
Posted by Anirudh <an...@gmail.com>.
Hi,
Upon deeper understanding of customer requirement we found out that the
customer uses only ASCII data with MXNet, just that they want the files
containing UTF-8 BOM at the start and files with different control
characters for newline to play well. dmlc-core already supports control
characters for newline.
Since, the UTF-8 BOM in files is a common use case for other users of MXNet
too (for example, saving excel as UTF-8 csv) I will add support for
handling the UTF-8 BOM in dmlc-core.
I won't be working on UTF8CSVParser unless there is a customer requirement
that comes up later on.
Anirudh
On Wed, Feb 28, 2018 at 11:50 PM, Anirudh <an...@gmail.com> wrote:
> Hi Tianqi,
>
> What do you think about adding a separate parser for CSV with UTF8 support
> in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
> UTF8 or the ASCII parser based on this flag. (This idea was suggested by
> Mu).
>
> I think there will be some small changes required to the base class
> "TextParserBase" as the method "BackFindEndLine" will have more logic in it
> to check for other code-points for line-breaks, which can be refactored.
> This approach will likely retain the performance of the existing ASCII CSV
> Parser, while allowing MXNet users to make the decision w.r.t usability
> with UTF-8 CSV parser / performance with ASCII CSV parser.
>
> Thanks,
> Anirudh
>
>
> On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <an...@gmail.com> wrote:
>
>> Hi Marco,
>>
>> I understand that there needs to be a different discussion on strong
>> dependency of mxnet and dmlc-core and how to fix it.
>>
>> Having said that, I think the goals of dmlc-core and mxnet are somewhat
>> aligned. Posting in the MXNet dev list for this case
>> is a good way to gather feedback from both the communities since I
>> consider the MXNet community to be mostly a superset of the dmlc-core
>> community.
>>
>> Anirudh
>>
>> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <an...@amazon.com>
>> wrote:
>>
>>> Hi Tianqi,
>>>
>>> The UTF-8 support would enable other formats like CSV more usable.
>>> Otherwise, they have to handle normalizing their data in some way before
>>> using mxnet.
>>> I understand that there is a tradeoff here because of the efficiency
>>> gains from the parser but the expectation of having to normalize their UTF-8
>>> files may turn users away.
>>>
>>> Anirudh
>>>
>>> On 2/26/18, 3:54 PM, "workcrow@gmail.com on behalf of Tianqi Chen" <
>>> workcrow@gmail.com on behalf of tqchen@cs.washington.edu> wrote:
>>>
>>> Since LibSVM format is only going to involve numbers and possibly
>>> ascii
>>> characters, is there any reason adding UTF-8 support? Note that
>>> generalization always comes with cost of efficiency and there is some
>>> effort spent on making parser fast
>>>
>>> Tianqi
>>>
>>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <an...@gmail.com>
>>> wrote:
>>>
>>> > Hi all,
>>> >
>>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
>>> parsers.
>>> > I am currently working on adding UTF-8 support for Text parsers.
>>> Since C++
>>> > doesn't have a great built-in support for UTF-8, I am looking at
>>> > third-party libraries which provide Unicode support. I am
>>> considering ICU
>>> > currently. Any comments, suggestions, past experience, gotchas
>>> about
>>> > unicode third party libraries or adding unicode support in general
>>> is
>>> > highly appreciated.
>>> >
>>> > I have created an issue about the same:
>>> > https://github.com/dmlc/dmlc-core/issues/372
>>> > Please feel free to reply to this email or comment on the github
>>> issue if
>>> > you have any inputs.
>>> >
>>> > Anirudh
>>> >
>>>
>>>
>>>
>>
>