You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@devicemap.apache.org by "Reza Naghibi (JIRA)" <ji...@apache.org> on 2015/07/13 18:10:05 UTC

[jira] [Commented] (DMAP-107) Performance optimizations for DeviceMapClient.classify()

    [ https://issues.apache.org/jira/browse/DMAP-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624864#comment-14624864 ] 

Reza Naghibi commented on DMAP-107:
-----------------------------------

Regarding removing the 2 test strings, looks like "HTC One X" and "HTC One X+" match the same since 1.0 normalizes out the regex. Otherwise, im guessing the + is an error because its likely meant to be a literal \+ and not a regex. Since the patterns are identical, the device choosen is the first one reached during iteration. The patch changes this because the data structure changes from a HashMap to a HashSet, so iteration order is different.

I could add logic to the ranking function to fix this, but at this point there is no use. Getting matching to properly work on the ODDR data will never be perfect because the data has many errors like the one above. So as always, the solution here is the fix the bad pattern.

Also, some problems with your split() method. It shouldn't be static and you can remove the reference to Apache commons by using String.isEmpty(). Not sure we need the null check, but null is allowed in normalize(), so its best to err on the side of safety. Below is the corrected version:

{code:java}
private List<String> split(String text) {
        List<String> nonemptyParts = new ArrayList<String>();

        String[] parts = TEXT_SPLIT_PATTERN.split(text);

        for (String part : parts) {
            String normalizedPart = Pattern.normalize(part);

            if (normalizedPart != null && !normalizedPart.isEmpty()) {
                nonemptyParts.add(normalizedPart);
            }
        }
        
        return nonemptyParts;
    }
{code}

Also, the style of the 1.0 Java client is to be explicit with imports and not use the wildcard. Just a small style nitpick. So if you can correct the above split() function (and fix the imports), your patch should be good to go with the HTC One X tests removed.

> Performance optimizations for DeviceMapClient.classify()
> --------------------------------------------------------
>
>                 Key: DMAP-107
>                 URL: https://issues.apache.org/jira/browse/DMAP-107
>             Project: DeviceMap
>          Issue Type: Improvement
>          Components: Java Client
>    Affects Versions: 1.1.0 Java
>            Reporter: Volkan Yazıcı
>             Fix For: 1.1.1 Java
>
>         Attachments: classify.diff
>
>
> This patch removes redundant {{DeviceType}} checks and User-Agent string normalization calls. Performance gain is more than 2x. Check out the devicemap-client-benchmark (introduced in issue DMAP-106) output:
> {code}
> $ export userAgentFile=/path/to/user-agents.txt
> $ wc -l $userAgentFile
> 195325
> $ java \
>     -jar devicemap/java/classifier-benchmark/target/devicemap-client-benchmark.jar \
>     -jvmArgsAppend "-server -XX:+TieredCompilation -XX:+AggressiveOpts -Xms1024m -Xmx4096m -DuserAgentFile=$userAgentFile" \
>     -wi 5 -i 5 -bm avgt -tu ms -f 3 \
>     ".*DeviceMapClientBenchmark.*"
> # Using the most recent trunk.
> Result: 12079.408 ±(99.9%) 1240.628 ms/op [Average]
>   Statistics: (min, avg, max) = (11232.424, 12079.408, 16011.000), stdev = 1160.484
>   Confidence interval (99.9%): [10838.781, 13320.036]
> # Using the enhanced classify().
> Result: 5505.355 ±(99.9%) 441.748 ms/op [Average]
>   Statistics: (min, avg, max) = (5060.269, 5505.355, 6508.699), stdev = 413.211
>   Confidence interval (99.9%): [5063.607, 5947.103]
> {code}
> JVM version: 1.8.0_25
> OS: OS X 10.9.5
> Processor: 2.6 GHz Intel Core i5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)