You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@metron.apache.org by GitBox <gi...@apache.org> on 2019/10/03 22:15:53 UTC
[GitHub] [metron] mmiklavc commented on issue #1525: METRON-2274 Flatfile loader and summarizer mapreduce mode broken

mmiklavc commented on issue #1525: METRON-2274 Flatfile loader and summarizer mapreduce mode broken
URL: https://github.com/apache/metron/pull/1525#issuecomment-538150156
 
 
   ## Test Plan
   
   Taken from:
   1. Flatfile loader - https://github.com/apache/metron/pull/432#issuecomment-276733075
   2. Flatfile summarizer - https://github.com/apache/metron/tree/master/use-cases/typosquat_detection#summarize
   
   ### Preliminaries
   
   * Spin up the dev environment for Centos 6 or 7
   * Run as root is fine
   * Root user needs a home dir in HDFS. You can do that as follows:
   ```
   sudo -u hdfs hdfs dfs -mkdir /user/root
   sudo -u hdfs hdfs dfs -chown root:root /user/root
   ```
   * Download the Alexa top 1m data set
   ```
   wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
   unzip top-1m.csv.zip
   ```
   
   * Stage import file
   ```
   head -n 10000 top-1m.csv > top-10k.csv
   hdfs dfs -put top-10k.csv /tmp
   ```
   
   * Truncate hbase
   ```
   echo "truncate 'enrichment'" | hbase shell
   ```
   
   ### Test the flatfile loader in MR mode
   
   * Create an extractor.json for the CSV data by editing `extractor.json` and pasting in these contents:
   ```
   {
     "config" : {
       "columns" : {
          "domain" : 1,
          "rank" : 0
                   }
       ,"indicator_column" : "domain"
       ,"type" : "alexa"
       ,"separator" : ","
                },
     "extractor" : "CSV"
   }
   ```
   
   * Import from HDFS via MR
   ```
   # import data into hbase 
   $METRON_HOME/bin/flatfile_loader.sh -i /tmp/top-10k.csv -t enrichment -c t -e ./extractor.json -m MR
   # count data written and verify it's 10k
   echo "count 'enrichment'" | hbase shell
   ```
   
   ### Test the flatfile summarizer in MR mode
   
   * Create an extractor-count.json file and paste the following:
   ```
   {
     "config" : {
       "columns" : {
          "rank" : 0,
          "domain" : 1
       },
       "value_transform" : {
          "domain" : "DOMAIN_REMOVE_TLD(domain)"
       },
       "value_filter" : "LENGTH(domain) > 0",
       "state_init" : "0L",
       "state_update" : {
          "state" : "state + LENGTH( DOMAIN_TYPOSQUAT( domain ))"
                        },
       "state_merge" : "REDUCE(states, (s, x) -> s + x, 0)",
       "separator" : ","
     },
     "extractor" : "CSV"
   }
   ```
   
   * Create the summary from HDFS via MR
   ```
   $METRON_HOME/bin/flatfile_summarizer.sh -i /tmp/top-10k.csv -e ~/extractor_count.json -p 5 -om CONSOLE -m MR
   ```
   * Verify you see a count in the output similar to the following:
   ```
   Processing /root/top-10k.csv
   19/10/03 21:19:56 WARN resolver.BaseFunctionResolver: Using System classloader
   Processed 9999 - \
   3478276
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services