You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metron.apache.org by cestella <gi...@git.apache.org> on 2017/09/29 16:46:58 UTC

[GitHub] metron pull request #781: METRON-1052:

GitHub user cestella opened a pull request:

    https://github.com/apache/metron/pull/781

    METRON-1052: 

    ## Contributor Comments
    This is a follow-on to METRON-539. Currently we have Stellar functions to perform cryptographic hashing operations. It would be useful to expand this to support forensic similarity hash functions so we could compare the similarity of inputs.  I have added support in this PR for [TLSH](https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf)
    
    This fits well within the existing `HASH` abstractions.  I have expanded and generalized it in a few places, but I think it's all within the spirit of the thing.
    
    I owe a use-case driven walk-through of a demo that I designed, but haven't written up yet.
    
    ## Pull Request Checklist
    
    Thank you for submitting a contribution to Apache Metron.  
    Please refer to our [Development Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235) for the complete guide to follow for contributions.  
    Please refer also to our [Build Verification Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview) for complete smoke testing guides.  
    
    
    In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:
    
    ### For all changes:
    - [x] Is there a JIRA ticket associated with this PR? If not one needs to be created at [Metron Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). 
    - [x] Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
    - [x] Has your PR been rebased against the latest commit within the target branch (typically master)?
    
    
    ### For code changes:
    - [x] Have you included steps to reproduce the behavior or problem that is being changed or addressed?
    - [x] Have you included steps or a guide to how the change may be verified and tested manually?
    - [x] Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
      ```
      mvn -q clean integration-test install && build_utils/verify_licenses.sh 
      ```
    
    - [x] Have you written or updated unit tests and or integration tests to verify your changes?
    - [x] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? 
    - [x] Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?
    
    ### For documentation related changes:
    - [x] Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via `site-book/target/site/index.html`:
    
      ```
      cd site-book
      mvn site
      ```
    
    #### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
    It is also recommended that [travis-ci](https://travis-ci.org) is set up for your personal repository such that your branches are built there before submitting a pull request.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cestella/incubator-metron METRON-1052-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/metron/pull/781.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #781
    
----
commit 1852d1089c401db8c735fc7b1bff5f385ba612f8
Author: cstella <ce...@gmail.com>
Date:   2017-09-29T16:41:25Z

    Shored up documentation a bit.

commit 2cb04cfaf360dae1041a51f649a51269b033df3e
Author: cstella <ce...@gmail.com>
Date:   2017-09-29T16:41:32Z

    Whoops, forgot one bit.

----


---

[GitHub] metron issue #781: METRON-1052: Add forensic similarity hash functions to St...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:

    https://github.com/apache/metron/pull/781
  
    +1 by inspection, really nice.


---

[GitHub] metron pull request #781: METRON-1052: Add forensic similarity hash function...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/metron/pull/781


---

[GitHub] metron issue #781: METRON-1052: Add forensic similarity hash functions to St...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:

    https://github.com/apache/metron/pull/781
  
    One day we will have a handle book, the Tao of Stellar, that will have use cases to function mapping and examples



---

[GitHub] metron pull request #781: METRON-1052: Add forensic similarity hash function...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/781#discussion_r141931953
  
    --- Diff: metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/TLSHHasher.java ---
    @@ -0,0 +1,203 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.metron.stellar.common.utils.hashing;
    +
    +import com.fasterxml.jackson.core.type.TypeReference;
    +import com.google.common.base.Joiner;
    +import com.google.common.collect.ImmutableList;
    +import com.trendmicro.tlsh.BucketOption;
    +import com.trendmicro.tlsh.ChecksumOption;
    +import com.trendmicro.tlsh.Tlsh;
    +import com.trendmicro.tlsh.TlshCreator;
    +import org.apache.commons.codec.DecoderException;
    +import org.apache.commons.codec.EncoderException;
    +import org.apache.commons.codec.binary.Hex;
    +import org.apache.metron.stellar.common.utils.ConversionUtils;
    +import org.apache.metron.stellar.common.utils.JSONUtils;
    +import org.apache.metron.stellar.common.utils.SerDeUtils;
    +
    +import java.io.File;
    +import java.io.IOException;
    +import java.nio.file.Files;
    +import java.security.NoSuchAlgorithmException;
    +import java.util.*;
    +import java.util.function.Function;
    +
    +public class TLSHHasher implements Hasher {
    +  public static final String TLSH_KEY = "tlsh";
    +  public static final String TLSH_BIN_KEY = "tlsh_bin";
    +  public enum Config implements EnumConfigurable {
    +    BUCKET_SIZE("bucketSize"),
    +    CHECKSUM("checksumBytes"),
    +    HASHES("hashes"),
    +    FORCE("force")
    +    ;
    +    final public String key;
    +    Config(String key) {
    +      this.key = key;
    +    }
    +
    +    @Override
    +    public String getKey() {
    +      return key;
    +    }
    +  }
    +
    +  BucketOption bucketOption = BucketOption.BUCKETS_128;
    +  ChecksumOption checksumOption = ChecksumOption.CHECKSUM_1B;
    +  Boolean force = true;
    +  List<Integer> hashes = new ArrayList<>();
    +
    +  /**
    +   * Returns an encoded string representation of the hash value of the input. It is expected that
    +   * this implementation does throw exceptions when the input is null.
    +   *
    +   * @param o The value to hash.
    +   * @return A hash of {@code toHash} that has been encoded.
    +   * @throws EncoderException         If unable to encode the hash then this exception occurs.
    +   * @throws NoSuchAlgorithmException If the supplied algorithm is not known.
    +   */
    +  @Override
    +  public Object getHash(Object o) throws EncoderException, NoSuchAlgorithmException {
    +    TlshCreator creator = new TlshCreator(bucketOption, checksumOption);
    --- End diff --
    
    yeah, actually, that's a damned fine suggestion.


---

[GitHub] metron issue #781: METRON-1052: Add forensic similarity hash functions to St...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/metron/pull/781
  
    Yeah, these are a bit different.  Actually TLSH is byte-level (and why it's part of the HASH function family.  I was thinking that FUZZY_SCORE might be modified in the future to take an algorithm and fold TLSH_DIST in.  The problem with it is that TLSH_DIST operates on two *hashes* not on two pieces of data.  It made more sense to me to separate it here.


---

[GitHub] metron pull request #781: METRON-1052: Add forensic similarity hash function...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/781#discussion_r141944809
  
    --- Diff: metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/TLSHHasher.java ---
    @@ -0,0 +1,203 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.metron.stellar.common.utils.hashing;
    +
    +import com.fasterxml.jackson.core.type.TypeReference;
    +import com.google.common.base.Joiner;
    +import com.google.common.collect.ImmutableList;
    +import com.trendmicro.tlsh.BucketOption;
    +import com.trendmicro.tlsh.ChecksumOption;
    +import com.trendmicro.tlsh.Tlsh;
    +import com.trendmicro.tlsh.TlshCreator;
    +import org.apache.commons.codec.DecoderException;
    +import org.apache.commons.codec.EncoderException;
    +import org.apache.commons.codec.binary.Hex;
    +import org.apache.metron.stellar.common.utils.ConversionUtils;
    +import org.apache.metron.stellar.common.utils.JSONUtils;
    +import org.apache.metron.stellar.common.utils.SerDeUtils;
    +
    +import java.io.File;
    +import java.io.IOException;
    +import java.nio.file.Files;
    +import java.security.NoSuchAlgorithmException;
    +import java.util.*;
    +import java.util.function.Function;
    +
    +public class TLSHHasher implements Hasher {
    +  public static final String TLSH_KEY = "tlsh";
    +  public static final String TLSH_BIN_KEY = "tlsh_bin";
    +  public enum Config implements EnumConfigurable {
    +    BUCKET_SIZE("bucketSize"),
    +    CHECKSUM("checksumBytes"),
    +    HASHES("hashes"),
    +    FORCE("force")
    +    ;
    +    final public String key;
    +    Config(String key) {
    +      this.key = key;
    +    }
    +
    +    @Override
    +    public String getKey() {
    +      return key;
    +    }
    +  }
    +
    +  BucketOption bucketOption = BucketOption.BUCKETS_128;
    +  ChecksumOption checksumOption = ChecksumOption.CHECKSUM_1B;
    +  Boolean force = true;
    +  List<Integer> hashes = new ArrayList<>();
    +
    +  /**
    +   * Returns an encoded string representation of the hash value of the input. It is expected that
    +   * this implementation does throw exceptions when the input is null.
    +   *
    +   * @param o The value to hash.
    +   * @return A hash of {@code toHash} that has been encoded.
    +   * @throws EncoderException         If unable to encode the hash then this exception occurs.
    +   * @throws NoSuchAlgorithmException If the supplied algorithm is not known.
    +   */
    +  @Override
    +  public Object getHash(Object o) throws EncoderException, NoSuchAlgorithmException {
    +    TlshCreator creator = new TlshCreator(bucketOption, checksumOption);
    --- End diff --
    
    Ok, I set up a cache to create and reuse the creators.  I also added a test case ensuring that the functions are all threadsafe with the use of the cache.


---

[GitHub] metron issue #781: METRON-1052: Add forensic similarity hash functions to St...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on the issue:

    https://github.com/apache/metron/pull/781
  
    FUZZY_SCORE and BLOOM_*.  Is it worth trying to unify these similarity-recognizers with similarity hash, or are they too far apart in terms of expected use patterns?


---

[GitHub] metron pull request #781: METRON-1052: Add forensic similarity hash function...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/781#discussion_r141920133
  
    --- Diff: metron-stellar/stellar-common/pom.xml ---
    @@ -51,6 +52,11 @@
                 </exclusions>
             </dependency>
             <dependency>
    --- End diff --
    
    Does there need to be a dependencies csv entry for this?


---

[GitHub] metron issue #781: METRON-1052: Add forensic similarity hash functions to St...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on the issue:

    https://github.com/apache/metron/pull/781
  
    Altho I suppose the "Locality-Sensitive" part of TLSH means it operates at word level instead of byte or character level?


---

[GitHub] metron pull request #781: METRON-1052: Add forensic similarity hash function...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/781#discussion_r141933534
  
    --- Diff: metron-stellar/stellar-common/pom.xml ---
    @@ -51,6 +52,11 @@
                 </exclusions>
             </dependency>
             <dependency>
    --- End diff --
    
    Yep, added and dealt with merge conflicts.


---

[GitHub] metron issue #781: METRON-1052: Add forensic similarity hash functions to St...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:

    https://github.com/apache/metron/pull/781
  
    Ok, I have not worked through the doc, but it looks good to me.  I can stop thinking about how this relates to FUZZY_SCORE.


---

[GitHub] metron issue #781: METRON-1052: Add forensic similarity hash functions to St...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/metron/pull/781
  
    Let me write up a step-by-step use-case doc and I'll call this one done.


---

[GitHub] metron pull request #781: METRON-1052: Add forensic similarity hash function...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/781#discussion_r141919312
  
    --- Diff: metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/TLSHHasher.java ---
    @@ -0,0 +1,203 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.metron.stellar.common.utils.hashing;
    +
    +import com.fasterxml.jackson.core.type.TypeReference;
    +import com.google.common.base.Joiner;
    +import com.google.common.collect.ImmutableList;
    +import com.trendmicro.tlsh.BucketOption;
    +import com.trendmicro.tlsh.ChecksumOption;
    +import com.trendmicro.tlsh.Tlsh;
    +import com.trendmicro.tlsh.TlshCreator;
    +import org.apache.commons.codec.DecoderException;
    +import org.apache.commons.codec.EncoderException;
    +import org.apache.commons.codec.binary.Hex;
    +import org.apache.metron.stellar.common.utils.ConversionUtils;
    +import org.apache.metron.stellar.common.utils.JSONUtils;
    +import org.apache.metron.stellar.common.utils.SerDeUtils;
    +
    +import java.io.File;
    +import java.io.IOException;
    +import java.nio.file.Files;
    +import java.security.NoSuchAlgorithmException;
    +import java.util.*;
    +import java.util.function.Function;
    +
    +public class TLSHHasher implements Hasher {
    +  public static final String TLSH_KEY = "tlsh";
    +  public static final String TLSH_BIN_KEY = "tlsh_bin";
    +  public enum Config implements EnumConfigurable {
    +    BUCKET_SIZE("bucketSize"),
    +    CHECKSUM("checksumBytes"),
    +    HASHES("hashes"),
    +    FORCE("force")
    +    ;
    +    final public String key;
    +    Config(String key) {
    +      this.key = key;
    +    }
    +
    +    @Override
    +    public String getKey() {
    +      return key;
    +    }
    +  }
    +
    +  BucketOption bucketOption = BucketOption.BUCKETS_128;
    +  ChecksumOption checksumOption = ChecksumOption.CHECKSUM_1B;
    +  Boolean force = true;
    +  List<Integer> hashes = new ArrayList<>();
    +
    +  /**
    +   * Returns an encoded string representation of the hash value of the input. It is expected that
    +   * this implementation does throw exceptions when the input is null.
    +   *
    +   * @param o The value to hash.
    +   * @return A hash of {@code toHash} that has been encoded.
    +   * @throws EncoderException         If unable to encode the hash then this exception occurs.
    +   * @throws NoSuchAlgorithmException If the supplied algorithm is not known.
    +   */
    +  @Override
    +  public Object getHash(Object o) throws EncoderException, NoSuchAlgorithmException {
    +    TlshCreator creator = new TlshCreator(bucketOption, checksumOption);
    --- End diff --
    
    Can these be cached and reused?  Similar to how we cache regex patterns?


---