You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucy.apache.org by "Marvin Humphrey (Commented) (JIRA)" <ji...@apache.org> on 2011/11/21 23:45:39 UTC

[lucy-issues] [jira] [Commented] (LUCY-191) Unicode normalization

    [ https://issues.apache.org/jira/browse/LUCY-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154693#comment-13154693 ] 

Marvin Humphrey commented on LUCY-191:
--------------------------------------

Great job, Nick!  You really nailed it with this contribution.

  * Public API and rough implementation design matches what you outlined and
    built consensus for on the dev list.
  * Proper documentation.
  * Builds clean.
  * Accompanied by tests and passes them.
  * Passes test_valgrind.
  * Looks portable.

I have a handful of minor suggestions, but as none of them are crucial, +1 to
commit verbatim.

Here are two things that I'd like to discuss on the dev list:

  * Memory allocated with malloc() within utf8proc() should not necessarily be
    freed with FREEMEM (which is an alias for lucy_Memory_wrapped_free.).
    This happens to be safe right now, but that's an implementation detail of
    Lucy::Util::Memory.
  * The fact that utf8proc forces us to reallocate with each operation rather
    than copying when possible as in SnowballStemmer is probably not optimal
    from a performance standpoint.

Here are two tiny details:

  * We can simplify the dump/load routines if we cache "form" within the
    object as a member variable.  
  * The keywords "true" and "false" are available in Clownfish, and I think we
    should use those as the defaults in the method signature for the boolean
    args.

That's all I got right now!  Nice work figuring out this patch with minimal
help.
                
> Unicode normalization
> ---------------------
>
>                 Key: LUCY-191
>                 URL: https://issues.apache.org/jira/browse/LUCY-191
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Nick Wellnhofer
>            Assignee: Marvin Humphrey
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.3.0 (incubating)
>
>         Attachments: LUCY-191-normalizer.patch
>
>
> As discussed on the mailing list, it would be nice to have Unicode normalization, Unicode case folding and stripping of accents as part of the analyzer chain. With the help of utf8proc this can be done in one pass. So I proposed a new analyzer Lucy::Analyzer::Normalizer with an interface described here:
> http://mail-archives.apache.org/mod_mbox/incubator-lucy-dev/201111.mbox/%3C4EC43816.1070107%40aevum.de%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira