You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spot.apache.org by na...@apache.org on 2018/03/19 19:28:10 UTC

[01/42] incubator-spot git commit: Adding DISCLAIMER to master

Repository: incubator-spot
Updated Branches:
  refs/heads/SPOT-181_ODM 0e3ef34a0 -> ee4e17d7e


Adding DISCLAIMER to master


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/81c371c3
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/81c371c3
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/81c371c3

Branch: refs/heads/SPOT-181_ODM
Commit: 81c371c3589772051e4c8e8d04e8ad39df477dd9
Parents: 1e9833e
Author: Everardo Lopez Sandoval (Intel) <el...@elopezsa-mac02.zpn.intel.com>
Authored: Fri Aug 4 17:01:55 2017 -0500
Committer: Everardo Lopez Sandoval (Intel) <el...@elopezsa-mac02.zpn.intel.com>
Committed: Fri Aug 4 17:01:55 2017 -0500

----------------------------------------------------------------------
 DISCLAIMER | 11 +++++++++++
 1 file changed, 11 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/81c371c3/DISCLAIMER
----------------------------------------------------------------------
diff --git a/DISCLAIMER b/DISCLAIMER
new file mode 100644
index 0000000..907de92
--- /dev/null
+++ b/DISCLAIMER
@@ -0,0 +1,11 @@
+DISCLAIMER
+
+Apache SPOT (incubating) is an effort undergoing incubation at the Apache
+Software Foundation (ASF), sponsored by the Apache Incubator PMC.
+Incubation is required of all newly accepted projects until a further review
+indicates that the infrastructure, communications, and decision making process
+have stabilized in a manner consistent with other successful ASF projects.
+
+While incubation status is not necessarily a reflection of the completeness or
+stability of the code, it does indicate that theproject has yet to be fully
+endorsed by the ASF.


[21/42] incubator-spot git commit: [SPOT-213][SPOT-216] [setup] updated scripts, documentation and spot.conf to support mutiple DB engines

Posted by na...@apache.org.
[SPOT-213][SPOT-216] [setup] updated scripts, documentation and spot.conf to support mutiple DB engines


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/49f4934c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/49f4934c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/49f4934c

Branch: refs/heads/SPOT-181_ODM
Commit: 49f4934c47e32ccda80111025ececf9e53780f11
Parents: 3383c07
Author: natedogs911 <na...@gmail.com>
Authored: Thu Jan 18 12:32:09 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Thu Jan 18 12:32:09 2018 -0800

----------------------------------------------------------------------
 spot-setup/README.md     |   7 +++
 spot-setup/hdfs_setup.sh | 120 +++++++++++++++++++++++++++++++++++++-----
 spot-setup/spot.conf     |  28 +++++++++-
 3 files changed, 139 insertions(+), 16 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/49f4934c/spot-setup/README.md
----------------------------------------------------------------------
diff --git a/spot-setup/README.md b/spot-setup/README.md
index 1d486a6..c5d245a 100644
--- a/spot-setup/README.md
+++ b/spot-setup/README.md
@@ -21,6 +21,11 @@ To collaborate and run spot-setup, it is required the following prerequisites:
 
 The main script in the repository is **hdfs_setup.sh** which is responsible of loading environment variables, creating folders in Hadoop for the different use cases (flow, DNS or Proxy), create the Impala database, and finally execute Impala query scripts that creates Impala tables needed to access netflow, dns and proxy data.
 
+Options:
+--no-sudo     will execute commands as the existing user while setting `HADOOP_USER_NAME=hdfs`
+-c            specify a custom location for the spot.conf, defaults to /etc/spot.conf
+-d            specific which database client to use `-d beeline` NOTE: Impala supports kerberos
+
 ## Environment Variables
 
 **spot.conf** is the file storing the variables needed during the installation process including node assignment, User interface, Machine Learning and Ingest gateway nodes.
@@ -33,6 +38,8 @@ To read more about these variables, please review the [documentation](http://spo
 
 spot-setup contains a script per use case, as of today, there is a table creation script for each DNS, flow and Proxy data.
 
+the HQL scripts are seperated by the underlying database in the ./spot-setup/ folder.
+
 These HQL scripts are intended to be executed as a Impala statement and must comply HQL standards.
 
 We create tables using Parquet format to get a faster query performance. This format is an industry standard and you can find more information about it on:

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/49f4934c/spot-setup/hdfs_setup.sh
----------------------------------------------------------------------
diff --git a/spot-setup/hdfs_setup.sh b/spot-setup/hdfs_setup.sh
index df898c8..6e73a20 100755
--- a/spot-setup/hdfs_setup.sh
+++ b/spot-setup/hdfs_setup.sh
@@ -17,6 +17,27 @@
 # limitations under the License.
 #
 
+set -e
+
+function log() {
+printf "hdfs_setup.sh:\n $1\n"
+}
+
+function safe_mkdir() {
+        # takes the hdfs command options and a directory
+        # checks for the directory before trying to create it
+        # keeps the script from existing on existing folders
+        local hdfs_cmd=$1
+        local dir=$2
+        if $(hdfs dfs -test -d ${dir}); then
+            log "${dir} already exists"
+        else
+            log "running mkdir on ${dir}"
+            ${hdfs_cmd} dfs -mkdir ${dir}
+        fi
+}
+
+SPOTCONF="/etc/spot.conf"
 DSOURCES=('flow' 'dns' 'proxy')
 DFOLDERS=('binary' 
 'stage'
@@ -33,37 +54,108 @@ DFOLDERS=('binary'
 'hive/oa/threat_dendro'
 )
 
+
+# input options
+for arg in "$@"; do
+    case $arg in
+        "--no-sudo")
+            log "not using sudo"
+            no_sudo=true
+            shift
+            ;;
+        "-c")
+            shift
+            SPOTCONF=$1
+            log "Spot Configuration file: ${SPOTCONF}"
+            shift
+            ;;
+        "-d")
+            shift
+            db_override=$1
+            shift
+            ;;
+    esac
+done
+
 # Sourcing spot configuration variables
-source /etc/spot.conf
+log "Sourcing ${SPOTCONF}\n"
+source $SPOTCONF
+
+if [[ ${no_sudo} == "true" ]]; then
+    hdfs_cmd="hdfs"
+
+    if [[ ! -z "${HADOOP_USER_NAME}" ]]; then
+        log "HADOOP_USER_NAME: ${HADOOP_USER_NAME}"
+    else
+        log "setting HADOOP_USER_NAME to hdfs"
+        HADOOP_USER_NAME=hdfs
+    fi
+else
+    hdfs_cmd="sudo -u hdfs hdfs"
+fi
+
+if [[ -z "${db_override}" ]]; then
+        DBENGINE=$(echo ${DBENGINE} | tr '[:upper:]' '[:lower:]')
+        log "setting database engine to ${DBENGINE}"
+else
+        DBENGINE=$(echo ${db_override} | tr '[:upper:]' '[:lower:]')
+        log "setting database engine to $db_override"
+fi
+
+case ${DBENGINE} in
+    impala)
+        db_shell="impala-shell -i ${IMPALA_DEM}"
+        if [[ ${KERBEROS} == "true" ]]; then
+            db_shell="${db_shell} -k"
+        fi
+        db_query="${db_shell} -q"
+        db_script="${db_shell} --var=huser=${HUSER} --var=dbname=${DBNAME} -c -f"
+        ;;
+    hive)
+        db_shell="hive"
+        db_query="${db_shell} -e"
+        db_script="${db_shell} -hiveconf huser=${HUSER} -hiveconf dbname=${DBNAME} -f"
+        ;;
+    beeline)
+        db_shell="beeline -u jdbc:${JDBC_URL}"
+        db_query="${db_shell} -e"
+        db_script="${db_shell} --hivevar huser=${HUSER} --hivevar dbname=${DBNAME} -f"
+        ;;
+    *)
+        log "DBENGINE not compatible or not set in spot.conf: DBENGINE--> ${DBENGINE:-empty}"
+        exit 1
+        ;;
+esac
 
 # Creating HDFS user's folder
-sudo -u hdfs hdfs dfs -mkdir ${HUSER}
-sudo -u hdfs hdfs dfs -chown ${USER}:supergroup ${HUSER}
-sudo -u hdfs hdfs dfs -chmod 775 ${HUSER}
+safe_mkdir ${hdfs_cmd} ${HUSER}
+${hdfs_cmd} dfs -chown ${USER}:supergroup ${HUSER}
+${hdfs_cmd} dfs -chmod 775 ${HUSER}
 
 # Creating HDFS paths for each use case
 for d in "${DSOURCES[@]}" 
-do 
+do
 	echo "creating /$d"
-	hdfs dfs -mkdir ${HUSER}/$d 
+	safe_mkdir hdfs ${HUSER}/$d
 	for f in "${DFOLDERS[@]}" 
 	do 
 		echo "creating $d/$f"
-		hdfs dfs -mkdir ${HUSER}/$d/$f
+		safe_mkdir ${hdfs_cmd} ${HUSER}/$d/$f
 	done
 
 	# Modifying permission on HDFS folders to allow Impala to read/write
 	hdfs dfs -chmod -R 775 ${HUSER}/$d
-	sudo -u hdfs hdfs dfs -setfacl -R -m user:impala:rwx ${HUSER}/$d
-	sudo -u hdfs hdfs dfs -setfacl -R -m user:${USER}:rwx ${HUSER}/$d
+	${hdfs_cmd} dfs -setfacl -R -m user:${db_override}:rwx ${HUSER}/$d
+	${hdfs_cmd} dfs -setfacl -R -m user:${USER}:rwx ${HUSER}/$d
 done
 
+
 # Creating Spot Database
-impala-shell -i ${IMPALA_DEM} -q "CREATE DATABASE IF NOT EXISTS ${DBNAME};"
+ ${db_query} "CREATE DATABASE IF NOT EXISTS ${DBNAME}";
+
 
-# Creating Impala tables
+# Creating tables
 for d in "${DSOURCES[@]}" 
-do 
-	impala-shell -i ${IMPALA_DEM} --var=huser=${HUSER} --var=dbname=${DBNAME} -c -f create_${d}_parquet.hql
+do
+	${db_script} "./${DBENGINE}/create_${d}_parquet.hql"
 done
-

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/49f4934c/spot-setup/spot.conf
----------------------------------------------------------------------
diff --git a/spot-setup/spot.conf b/spot-setup/spot.conf
index a0cba3d..aa08ea7 100755
--- a/spot-setup/spot.conf
+++ b/spot-setup/spot.conf
@@ -19,7 +19,6 @@
 UINODE='node03'
 MLNODE='node04'
 GWNODE='node16'
-DBNAME='spot'
 
 #hdfs - base user and data source config
 HUSER='/user/spot'
@@ -30,10 +29,35 @@ PROXY_PATH=${HUSER}/${DSOURCE}/hive/y=${YR}/m=${MH}/d=${DY}/
 FLOW_PATH=${HUSER}/${DSOURCE}/hive/y=${YR}/m=${MH}/d=${DY}/
 HPATH=${HUSER}/${DSOURCE}/scored_results/${FDATE}
 
-#impala config
+# Database
+DBNAME='spot'
+DBENGINE="" # hive,impala and beeline supported
+JDBC_URL="" # example hive2://node01:10000/default;principal=hive/node01@REALM.COM
+
+# impala config
 IMPALA_DEM=node04
 IMPALA_PORT=21050
 
+# Hive Server2
+HS2_HOST=''
+HS2_PORT=''
+
+#kerberos config
+KERBEROS='false'
+KINIT=/usr/bin/kinit
+PRINCIPAL='user'
+KEYTAB='/opt/security/user.keytab'
+SASL_MECH='GSSAPI'
+SECURITY_PROTO='sasl_plaintext'
+KAFKA_SERVICE_NAME=''
+
+#ssl config
+SSL='false'
+SSL_VERIFY='true'
+CA_LOCATION=''
+CERT=''
+KEY=''
+
 #local fs base user and data source config
 LUSER='/home/spot'
 LPATH=${LUSER}/ml/${DSOURCE}/${FDATE}


[17/42] incubator-spot git commit: [SPOT-213][SPOT-250][OA][DATA] temp fix for impala calls, add TODO for impyla conversion

Posted by na...@apache.org.
[SPOT-213][SPOT-250][OA][DATA] temp fix for impala calls, add TODO for impyla conversion


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/0e749191
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/0e749191
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/0e749191

Branch: refs/heads/SPOT-181_ODM
Commit: 0e749191311a3c1695cd40322c1b5788cc56e50c
Parents: d1f5a67
Author: natedogs911 <na...@gmail.com>
Authored: Thu Jan 18 11:06:50 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Thu Jan 18 11:06:50 2018 -0800

----------------------------------------------------------------------
 spot-oa/oa/components/data/hive.py   |  1 +
 spot-oa/oa/components/data/impala.py | 25 +++++++++++++++++++++----
 2 files changed, 22 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/0e749191/spot-oa/oa/components/data/hive.py
----------------------------------------------------------------------
diff --git a/spot-oa/oa/components/data/hive.py b/spot-oa/oa/components/data/hive.py
index a7c1d4b..7d2eaa2 100644
--- a/spot-oa/oa/components/data/hive.py
+++ b/spot-oa/oa/components/data/hive.py
@@ -24,6 +24,7 @@ class Engine(object):
         self._pipeline = pipeline
 
     def query(self,query,output_file=None, delimiter=','):
+        # TODO: fix kerberos compatibility, use impyla
         hive_config = "set mapred.max.split.size=1073741824;set hive.exec.reducers.max=10;set hive.cli.print.header=true;"
         
         del_format = "| sed 's/[\t]/{0}/g'".format(delimiter)

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/0e749191/spot-oa/oa/components/data/impala.py
----------------------------------------------------------------------
diff --git a/spot-oa/oa/components/data/impala.py b/spot-oa/oa/components/data/impala.py
index bfc1c5a..10d1f5b 100644
--- a/spot-oa/oa/components/data/impala.py
+++ b/spot-oa/oa/components/data/impala.py
@@ -16,6 +16,8 @@
 #
 
 from subprocess import check_output
+from common import configurator
+
 
 class Engine(object):
 
@@ -24,17 +26,32 @@ class Engine(object):
         self._daemon_node = conf['impala_daemon']
         self._db = db
         self._pipeline = pipeline
-        impala_cmd = "impala-shell -i {0} --quiet -q 'INVALIDATE METADATA {1}.{2}'".format(self._daemon_node,self._db, self._pipeline)
+
+        if configurator.kerberos_enabled():
+            self._impala_shell = "impala-shell -k -i {0} --quiet".format(self._daemon_node)
+        else:
+            self._impala_shell = "impala-shell -i {0} --quiet".format(self._daemon_node)
+
+        impala_cmd = "{0} -q 'INVALIDATE METADATA {1}.{2}'".format(self._impala_shell, self._db, self._pipeline)
         check_output(impala_cmd,shell=True)
     
-        impala_cmd = "impala-shell -i {0} --quiet -q 'REFRESH {1}.{2}'".format(self._daemon_node,self._db, self._pipeline)
+        impala_cmd = "{0} -q 'REFRESH {1}.{2}'".format(self._impala_shell, self._db, self._pipeline)
         check_output(impala_cmd,shell=True)
 
     def query(self,query,output_file=None,delimiter=","):
 
         if output_file:
-            impala_cmd = "impala-shell -i {0} --quiet --print_header -B --output_delimiter='{1}' -q \"{2}\" -o {3}".format(self._daemon_node,delimiter,query,output_file)
+            impala_cmd = "{0} --print_header -B --output_delimiter='{1}' -q \"{2}\" -o {3}".format(
+                self._impala_shell,
+                delimiter,
+                query,
+                output_file
+            )
         else:
-            impala_cmd = "impala-shell -i {0} --quiet --print_header -B --output_delimiter='{1}' -q \"{2}\"".format(self._daemon_node,delimiter,query)
+            impala_cmd = "{0} --print_header -B --output_delimiter='{1}' -q \"{2}\"".format(
+                self._impala_shell,
+                delimiter,
+                query
+            )
 
         check_output(impala_cmd,shell=True)


[16/42] incubator-spot git commit: [SPOT-213][SPOT-250][OA][API] add kerberos support

Posted by na...@apache.org.
[SPOT-213][SPOT-250][OA][API] add kerberos support


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/d1f5a67f
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/d1f5a67f
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/d1f5a67f

Branch: refs/heads/SPOT-181_ODM
Commit: d1f5a67f929090e2bad865d53d4389c69b176fc5
Parents: 7376c5e
Author: natedogs911 <na...@gmail.com>
Authored: Thu Jan 18 11:02:39 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Thu Jan 18 11:02:39 2018 -0800

----------------------------------------------------------------------
 spot-oa/api/resources/configurator.py  |  69 +++++++++-
 spot-oa/api/resources/hdfs_client.py   | 201 ++++++++++++++++++++++++----
 spot-oa/api/resources/impala_engine.py |  29 +++-
 3 files changed, 262 insertions(+), 37 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/d1f5a67f/spot-oa/api/resources/configurator.py
----------------------------------------------------------------------
diff --git a/spot-oa/api/resources/configurator.py b/spot-oa/api/resources/configurator.py
index 5bda045..017732d 100644
--- a/spot-oa/api/resources/configurator.py
+++ b/spot-oa/api/resources/configurator.py
@@ -14,35 +14,90 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+
 import ConfigParser
-import os
+from io import open
+
 
 def configuration():
 
-    conf_file = "/etc/spot.conf"
     config = ConfigParser.ConfigParser()
-    config.readfp(SecHead(open(conf_file)))
+
+    try:
+        conf = open("/etc/spot.conf", "r")
+    except (OSError, IOError) as e:
+        print("Error opening: spot.conf" + " error: " + e.errno)
+        raise e
+
+    config.readfp(SecHead(conf))
     return config
 
+
 def db():
     conf = configuration()
-    return conf.get('conf', 'DBNAME').replace("'","").replace('"','')
+    return conf.get('conf', 'DBNAME').replace("'", "").replace('"', '')
+
 
 def impala():
     conf = configuration()
-    return conf.get('conf', 'IMPALA_DEM'),conf.get('conf', 'IMPALA_PORT')
+    return conf.get('conf', 'IMPALA_DEM'), conf.get('conf', 'IMPALA_PORT')
+
 
 def hdfs():
     conf = configuration()
     name_node = conf.get('conf',"NAME_NODE")
     web_port = conf.get('conf',"WEB_PORT")
     hdfs_user = conf.get('conf',"HUSER")
-    hdfs_user = hdfs_user.split("/")[-1].replace("'","").replace('"','')
+    hdfs_user = hdfs_user.split("/")[-1].replace("'", "").replace('"', '')
     return name_node,web_port,hdfs_user
 
+
 def spot():
     conf = configuration()
-    return conf.get('conf',"HUSER").replace("'","").replace('"','')
+    return conf.get('conf',"HUSER").replace("'", "").replace('"', '')
+
+
+def kerberos_enabled():
+    conf = configuration()
+    enabled = conf.get('conf', 'KERBEROS').replace("'", "").replace('"', '')
+    if enabled.lower() == 'true':
+        return True
+    else:
+        return False
+
+
+def kerberos():
+    conf = configuration()
+    if kerberos_enabled():
+        principal = conf.get('conf', 'PRINCIPAL')
+        keytab = conf.get('conf', 'KEYTAB')
+        sasl_mech = conf.get('conf', 'SASL_MECH')
+        security_proto = conf.get('conf', 'SECURITY_PROTO')
+        return principal, keytab, sasl_mech, security_proto
+    else:
+        raise KeyError
+
+
+def ssl_enabled():
+    conf = configuration()
+    enabled = conf.get('conf', 'SSL')
+    if enabled.lower() == 'true':
+        return True
+    else:
+        return False
+
+
+def ssl():
+    conf = configuration()
+    if ssl_enabled():
+        ssl_verify = conf.get('conf', 'SSL_VERIFY')
+        ca_location = conf.get('conf', 'CA_LOCATION')
+        cert = conf.get('conf', 'CERT')
+        key = conf.get('conf', 'KEY')
+        return ssl_verify, ca_location, cert, key
+    else:
+        raise KeyError
+
 
 class SecHead(object):
     def __init__(self, fp):

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/d1f5a67f/spot-oa/api/resources/hdfs_client.py
----------------------------------------------------------------------
diff --git a/spot-oa/api/resources/hdfs_client.py b/spot-oa/api/resources/hdfs_client.py
index 31c5eba..e7f6bec 100644
--- a/spot-oa/api/resources/hdfs_client.py
+++ b/spot-oa/api/resources/hdfs_client.py
@@ -14,63 +14,216 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-from hdfs import InsecureClient
+
 from hdfs.util import HdfsError
+from hdfs import Client
+from hdfs.ext.kerberos import KerberosClient
+from requests import Session
 from json import dump
-import api.resources.configurator as Config
+from threading import Lock
+import logging
+import configurator as Config
+from sys import stderr
+
+
+class Progress(object):
+
+    """Basic progress tracker callback."""
+
+    def __init__(self, hdfs_path, nbytes):
+        self._data = {}
+        self._lock = Lock()
+        self._hpath = hdfs_path
+        self._nbytes = nbytes
+
+    def __call__(self):
+        with self._lock:
+            if self._nbytes >= 0:
+                self._data[self._hpath] = self._nbytes
+            else:
+                stderr.write('%s\n' % (sum(self._data.values()), ))
+
+
+class SecureKerberosClient(KerberosClient):
+
+    """A new client subclass for handling HTTPS connections with Kerberos.
+
+    :param url: URL to namenode.
+    :param cert: Local certificate. See `requests` documentation for details
+      on how to use this.
+    :param verify: Whether to check the host's certificate. WARNING: non production use only
+    :param \*\*kwargs: Keyword arguments passed to the default `Client`
+      constructor.
+
+    """
+
+    def __init__(self, url, mutual_auth, cert=None, verify='true', **kwargs):
+
+        self._logger = logging.getLogger("SPOT.INGEST.HDFS_client")
+        session = Session()
+
+        if verify == 'true':
+            self._logger.info('SSL verification enabled')
+            session.verify = True
+            if cert is not None:
+                self._logger.info('SSL Cert: ' + cert)
+                if ',' in cert:
+                    session.cert = [path.strip() for path in cert.split(',')]
+                else:
+                    session.cert = cert
+        elif verify == 'false':
+            session.verify = False
+
+        super(SecureKerberosClient, self).__init__(url, mutual_auth, session=session, **kwargs)
+
 
+class HdfsException(HdfsError):
+    def __init__(self, message):
+        super(HdfsException, self).__init__(message)
+        self.message = message
+
+
+def get_client(user=None):
+    # type: (object) -> Client
+
+    logger = logging.getLogger('SPOT.INGEST.HDFS.get_client')
+    hdfs_nm, hdfs_port, hdfs_user = Config.hdfs()
+    conf = {'url': '{0}:{1}'.format(hdfs_nm, hdfs_port)}
+
+    if Config.ssl_enabled():
+        ssl_verify, ca_location, cert, key = Config.ssl()
+        conf.update({'verify': ssl_verify.lower()})
+        if cert:
+            conf.update({'cert': cert})
+
+    if Config.kerberos_enabled():
+        krb_conf = {'mutual_auth': 'OPTIONAL'}
+        conf.update(krb_conf)
+
+    # TODO: possible user parameter
+    logger.info('Client conf:')
+    for k,v in conf.iteritems():
+        logger.info(k + ': ' + v)
+
+    client = SecureKerberosClient(**conf)
 
-def _get_client(user=None):
-    hdfs_nm,hdfs_port,hdfs_user = Config.hdfs()
-    client = InsecureClient('http://{0}:{1}'.format(hdfs_nm,hdfs_port), user= user if user else hdfs_user)
     return client
 
-def get_file(hdfs_file):
-    client = _get_client()
+
+def get_file(hdfs_file, client=None):
+    if not client:
+        client = get_client()
+
     with client.read(hdfs_file) as reader:
         results = reader.read()
         return results
 
-def put_file_csv(hdfs_file_content,hdfs_path,hdfs_file_name,append_file=False,overwrite_file=False):
-    
+
+def upload_file(hdfs_fp, local_fp, overwrite=False, client=None):
+    if not client:
+        client = get_client()
+
+    try:
+        result = client.upload(hdfs_fp, local_fp, overwrite=overwrite, progress=Progress)
+        return result
+    except HdfsError as err:
+        return err
+
+
+def download_file(hdfs_path, local_path, overwrite=False, client=None):
+    if not client:
+        client = get_client()
+
+    try:
+        client.download(hdfs_path, local_path, overwrite=overwrite)
+        return True
+    except HdfsError:
+        return False
+
+
+def mkdir(hdfs_path, client=None):
+    if client is not None:
+        client = get_client()
+
+    try:
+        client.makedirs(hdfs_path)
+        return True
+    except HdfsError:
+        return False
+
+
+def put_file_csv(hdfs_file_content,hdfs_path,hdfs_file_name,append_file=False,overwrite_file=False, client=None):
+    if not client:
+        client = get_client()
+
     try:
-        client = _get_client()
         hdfs_full_name = "{0}/{1}".format(hdfs_path,hdfs_file_name)
         with client.write(hdfs_full_name,append=append_file,overwrite=overwrite_file) as writer:
             for item in hdfs_file_content:
                 data = ','.join(str(d) for d in item)
                 writer.write("{0}\n".format(data))
         return True
-        
+
     except HdfsError:
         return False
 
-def put_file_json(hdfs_file_content,hdfs_path,hdfs_file_name,append_file=False,overwrite_file=False):
-    
+
+def put_file_json(hdfs_file_content,hdfs_path,hdfs_file_name,append_file=False,overwrite_file=False, client=None):
+    if not client:
+        client = get_client()
+
     try:
-        client = _get_client()
         hdfs_full_name = "{0}/{1}".format(hdfs_path,hdfs_file_name)
         with client.write(hdfs_full_name,append=append_file,overwrite=overwrite_file,encoding='utf-8') as writer:
-	        dump(hdfs_file_content, writer)
+            dump(hdfs_file_content, writer)
         return True
     except HdfsError:
         return False
-    
 
-def delete_folder(hdfs_file,user=None):
-    client = _get_client(user)
-    client.delete(hdfs_file,recursive=True)
 
-def list_dir(hdfs_path):
+def delete_folder(hdfs_file, user=None, client=None):
+    if not client:
+        client = get_client()
+
+    try:
+        client.delete(hdfs_file,recursive=True)
+    except HdfsError:
+        return False
+
+
+def check_dir(hdfs_path, client=None):
+    """
+    Returns True if directory exists
+    Returns False if directory does not exist
+    : param hdfs_path: path to check
+    : object client: hdfs client object for persistent connection
+    """
+    if not client:
+        client = get_client()
+
+    result = client.list(hdfs_path)
+    if None not in result:
+        return True
+    else:
+        return False
+
+
+def list_dir(hdfs_path, client=None):
+    if not client:
+        client = get_client()
+
     try:
-        client = _get_client()
         return client.list(hdfs_path)
     except HdfsError:
         return {}
 
-def file_exists(hdfs_path,file_name):
-    files = list_dir(hdfs_path)
+
+def file_exists(hdfs_path, file_name, client=None):
+    if not client:
+        client = get_client()
+
+    files = list_dir(client, hdfs_path)
     if str(file_name) in files:
-	    return True
+        return True
     else:
         return False

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/d1f5a67f/spot-oa/api/resources/impala_engine.py
----------------------------------------------------------------------
diff --git a/spot-oa/api/resources/impala_engine.py b/spot-oa/api/resources/impala_engine.py
index b7d0148..542bbd0 100644
--- a/spot-oa/api/resources/impala_engine.py
+++ b/spot-oa/api/resources/impala_engine.py
@@ -15,15 +15,33 @@
 # limitations under the License.
 #
 from impala.dbapi import connect
-import api.resources.configurator as Config
+import api.resources.configurator as config
+
 
 def create_connection():
 
-    impala_host, impala_port =  Config.impala()
-    db = Config.db()
-    conn = connect(host=impala_host, port=int(impala_port),database=db)
+    impala_host, impala_port = config.impala()
+    conf = {}
+
+    # TODO: if using hive, kerberos service name must be changed, impyla sets 'impala' as default
+    service_name = {'kerberos_service_name': 'impala'}
+
+    if config.kerberos_enabled():
+        principal, keytab, sasl_mech, security_proto = config.kerberos()
+        conf.update({'auth_mechanism': 'GSSAPI',
+                     })
+
+    if config.ssl_enabled():
+        ssl_verify, ca_location, cert, key = config.ssl()
+        conf.update({'ca_cert': cert,
+                     'use_ssl': ssl_verify
+                     })
+
+    db = config.db()
+    conn = connect(host=impala_host, port=int(impala_port), database=db, **conf)
     return conn.cursor()
 
+
 def execute_query(query,fetch=False):
 
     impala_cursor = create_connection()
@@ -31,6 +49,7 @@ def execute_query(query,fetch=False):
 
     return impala_cursor if not fetch else impala_cursor.fetchall()
 
+
 def execute_query_as_list(query):
 
     query_results = execute_query(query)
@@ -46,5 +65,3 @@ def execute_query_as_list(query):
         row_result = {}
 
     return results
-
-


[38/42] incubator-spot git commit: whitespace fix

Posted by na...@apache.org.
whitespace fix


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/4ebf4be1
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/4ebf4be1
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/4ebf4be1

Branch: refs/heads/SPOT-181_ODM
Commit: 4ebf4be1190da91bb23aab73db93dae0bc7f17ba
Parents: 6b79abb
Author: tpltnt <tp...@dropcut.net>
Authored: Thu Jan 25 12:42:57 2018 +0100
Committer: tpltnt <tp...@dropcut.net>
Committed: Thu Jan 25 12:42:57 2018 +0100

----------------------------------------------------------------------
 spot-ingest/pipelines/proxy/bluecoat.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/4ebf4be1/spot-ingest/pipelines/proxy/bluecoat.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/proxy/bluecoat.py b/spot-ingest/pipelines/proxy/bluecoat.py
index 2f5da0d..541abb5 100644
--- a/spot-ingest/pipelines/proxy/bluecoat.py
+++ b/spot-ingest/pipelines/proxy/bluecoat.py
@@ -197,5 +197,5 @@ def bluecoat_parse(zk, topic, db, db_table, num_of_workers, batch_size):
     ssc.awaitTermination()
 
 
-if __name__ =='__main__':
+if __name__ == '__main__':
     main()


[07/42] incubator-spot git commit: Fix dns_oa.py to stop crashing FBThreatExchange

Posted by na...@apache.org.
Fix dns_oa.py to stop crashing FBThreatExchange

Stop the start_oa.py script from crashing, while performing reputation check for the dns_results.csv domains. 
https://issues.apache.org/jira/browse/SPOT-238

Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/3e7b628c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/3e7b628c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/3e7b628c

Branch: refs/heads/SPOT-181_ODM
Commit: 3e7b628cccdefbf021d9b335bd113108880a32ba
Parents: dbf6f51
Author: lighteternal <dp...@gmail.com>
Authored: Tue Oct 31 12:57:29 2017 +0200
Committer: GitHub <no...@github.com>
Committed: Tue Oct 31 12:57:29 2017 +0200

----------------------------------------------------------------------
 spot-oa/oa/dns/dns_oa.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3e7b628c/spot-oa/oa/dns/dns_oa.py
----------------------------------------------------------------------
diff --git a/spot-oa/oa/dns/dns_oa.py b/spot-oa/oa/dns/dns_oa.py
index 5982e8b..5023d7f 100644
--- a/spot-oa/oa/dns/dns_oa.py
+++ b/spot-oa/oa/dns/dns_oa.py
@@ -232,7 +232,7 @@ class OA(object):
                     rep_results = {k: "{0}::{1}".format(rep_results.get(k, ""), result.get(k, "")).strip('::') for k in set(rep_results) | set(result)}
 
                 if rep_results:
-                    self._dns_scores = [ conn + [ rep_results[conn[key]] ]   for conn in self._dns_scores  ]
+                    self._dns_scores = [ conn + [ rep_results.get(key) ]    for conn in self._dns_scores  ]
                 else:
                     self._dns_scores = [ conn + [""]   for conn in self._dns_scores  ]
         else:
@@ -418,4 +418,4 @@ class OA(object):
             query_to_insert=("""
                 INSERT INTO {0}.dns_ingest_summary PARTITION (y={1}, m={2}, d={3}) VALUES {4};
             """).format(self._db, yr, mn, dy, tuple(df_final))
-            impala.execute_query(query_to_insert)
\ No newline at end of file
+            impala.execute_query(query_to_insert)


[13/42] incubator-spot git commit: Merge branch 'pr/128' to close apache/incubator-spot#128

Posted by na...@apache.org.
Merge branch 'pr/128' to close apache/incubator-spot#128


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/2294bf43
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/2294bf43
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/2294bf43

Branch: refs/heads/SPOT-181_ODM
Commit: 2294bf436222eec2f84d2dbe2ea283f8e3a608c3
Parents: b5299b5 f935f1e
Author: natedogs911 <na...@gmail.com>
Authored: Tue Jan 9 19:21:59 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Tue Jan 9 19:21:59 2018 -0800

----------------------------------------------------------------------
 spot-ingest/common/kafka_topic.sh      | 2 +-
 spot-ingest/start_ingest_standalone.sh | 2 +-
 spot-ml/ml_ops.sh                      | 2 +-
 spot-ml/ml_test.sh                     | 4 ++--
 spot-oa/runIpython.sh                  | 2 +-
 spot-setup/hdfs_setup.sh               | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)
----------------------------------------------------------------------



[37/42] incubator-spot git commit: fixed bluecoat_parse()

Posted by na...@apache.org.
fixed bluecoat_parse()


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/6b79abbb
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/6b79abbb
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/6b79abbb

Branch: refs/heads/SPOT-181_ODM
Commit: 6b79abbb079d99d283664382fba131864049f1fa
Parents: 2ea6b4e
Author: tpltnt <tp...@dropcut.net>
Authored: Thu Jan 25 12:40:38 2018 +0100
Committer: tpltnt <tp...@dropcut.net>
Committed: Thu Jan 25 12:42:22 2018 +0100

----------------------------------------------------------------------
 spot-ingest/pipelines/proxy/bluecoat.py | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/6b79abbb/spot-ingest/pipelines/proxy/bluecoat.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/proxy/bluecoat.py b/spot-ingest/pipelines/proxy/bluecoat.py
index 597c13c..2f5da0d 100644
--- a/spot-ingest/pipelines/proxy/bluecoat.py
+++ b/spot-ingest/pipelines/proxy/bluecoat.py
@@ -170,21 +170,30 @@ def save_data(rdd, sqc, db, db_table, topic):
         print("------------------------LISTENING KAFKA TOPIC:{0}------------------------".format(topic))
 
 
-def bluecoat_parse(zk,topic,db,db_table,num_of_workers,batch_size):
-    
+def bluecoat_parse(zk, topic, db, db_table, num_of_workers, batch_size):
+    """
+    Parse and save bluecoat logs.
+
+    :param zk: Apache ZooKeeper quorum
+    :param topic: Apache Kafka topic (application name)
+    :param db: Apache Hive database to save into
+    :param db_table: table of `db` to save into
+    :param num_of_workers: number of Apache Kafka workers
+    :param batch_size: batch size for Apache Spark streaming context
+    """
     app_name = topic
     wrks = int(num_of_workers)
 
     # create spark context
     sc = SparkContext(appName=app_name)
-    ssc = StreamingContext(sc,int(batch_size))
+    ssc = StreamingContext(sc, int(batch_size))
     sqc = HiveContext(sc)
 
     tp_stream = KafkaUtils.createStream(ssc, zk, app_name, {topic: wrks}, keyDecoder=spot_decoder, valueDecoder=spot_decoder)
 
-    proxy_data = tp_stream.map(lambda row: row[1]).flatMap(lambda row: row.split("\n")).filter(lambda row: rex_date.match(row)).map(lambda row: row.strip("\n").strip("\r").replace("\t", " ").replace("  ", " ")).map(lambda row:  split_log_entry(row)).map(lambda row: proxy_parser(row))
-    saved_data = proxy_data.foreachRDD(lambda row: save_data(row,sqc,db,db_table,topic))
-    ssc.start();
+    proxy_data = tp_stream.map(lambda row: row[1]).flatMap(lambda row: row.split("\n")).filter(lambda row: rex_date.match(row)).map(lambda row: row.strip("\n").strip("\r").replace("\t", " ").replace("  ", " ")).map(lambda row: split_log_entry(row)).map(lambda row: proxy_parser(row))
+    saved_data = proxy_data.foreachRDD(lambda row: save_data(row, sqc, db, db_table, topic))
+    ssc.start()
     ssc.awaitTermination()
 
 


[02/42] incubator-spot git commit: Remove hardcoded gti reputation

Posted by na...@apache.org.
Remove hardcoded gti reputation

Use the reputation services defined by the user in reputation json file instead.

Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/671dfd77
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/671dfd77
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/671dfd77

Branch: refs/heads/SPOT-181_ODM
Commit: 671dfd773bee3d59c1a0c0127d9c98bfa1da0de0
Parents: 2ebe572
Author: castleguarders <ca...@users.noreply.github.com>
Authored: Tue Sep 26 10:55:51 2017 -0700
Committer: GitHub <no...@github.com>
Committed: Tue Sep 26 10:55:51 2017 -0700

----------------------------------------------------------------------
 spot-oa/oa/flow/flow_oa.py | 43 +++++++++++++++++++++++------------------
 1 file changed, 24 insertions(+), 19 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/671dfd77/spot-oa/oa/flow/flow_oa.py
----------------------------------------------------------------------
diff --git a/spot-oa/oa/flow/flow_oa.py b/spot-oa/oa/flow/flow_oa.py
index 53cec6b..000d9d0 100644
--- a/spot-oa/oa/flow/flow_oa.py
+++ b/spot-oa/oa/flow/flow_oa.py
@@ -34,7 +34,6 @@ from multiprocessing import Process
 from utils import Util, ProgressBar
 from components.data.data import Data
 from components.geoloc.geoloc import GeoLocalization
-from components.reputation.gti import gti
 from impala.util import as_pandas
 import time
 
@@ -267,37 +266,49 @@ class OA(object):
         # read configuration.
         self._logger.info("Reading reputation configuration file: {0}".format(reputation_conf_file))
         rep_conf = json.loads(open(reputation_conf_file).read())
- 
-        if "gti" in rep_conf and os.path.isfile(rep_conf['gti']['refclient']):
-            rep_conf = rep_conf['gti']
-            # initialize gti module.
-            self._logger.info("Initializing GTI component")
-            flow_gti = gti.Reputation(rep_conf,self._logger)
 
-            # get all src ips.
+        # initialize reputation services.
+        self._rep_services = []
+        self._logger.info("Initializing reputation services.")
+        for service in rep_conf:
+             config = rep_conf[service]
+             module = __import__("components.reputation.{0}.{0}".format(service), fromlist=['Reputation'])
+             self._rep_services.append(module.Reputation(config,self._logger))
+
+	if self._rep_services :
+ 
+           # get all src ips.
             src_ip_index = self._conf["flow_score_fields"]["srcIP"]
             dst_ip_index = self._conf["flow_score_fields"]["dstIP"]
 
-            self._logger.info("Getting GTI reputation for src IPs")
             flow_scores_src = iter(self._flow_scores)
 
             # getting reputation for src IPs
             src_ips = [ conn[src_ip_index] for conn in flow_scores_src ]            
-            src_rep_results = flow_gti.check(src_ips)
+	    self._logger.info("Getting reputation for each service in config")
+            src_rep_results = {}
+	    for rep_service in self._rep_services:
+                # if more than one reputation service is defined, the last ip match remains after merge
+                # Example fb: returns an entry for every ip, including unknown ones
+                # which overwrites other services that have previously returned a match. Same for dstip
+                # In future should consider a weigted merge, or UX should support multiple reps per IP
+	        src_rep_results = dict(rep_service.check(src_ips).items() + src_rep_results.items())
 
-            self._logger.info("Getting GTI reputation for dst IPs")
             flow_scores_dst = iter(self._flow_scores)
 
             # getting reputation for dst IPs            
             dst_ips = [  conn[dst_ip_index] for conn in flow_scores_dst ]
-            dst_rep_results = flow_gti.check(dst_ips)
+            dst_rep_results = {}
+	    for rep_service in self._rep_services:
+                dst_rep_results = dict(rep_service.check(dst_ips).items() + dst_rep_results.items()) 
 
+	    
             flow_scores_final = iter(self._flow_scores)
 
             self._flow_scores = []
             flow_scores = [conn + [src_rep_results[conn[src_ip_index]]] + [dst_rep_results[conn[dst_ip_index]]] for conn in flow_scores_final ]
             self._flow_scores = flow_scores           
-            
+
         else:
             # add values to gtiSrcRep and gtiDstRep.
             flow_scores = iter(self._flow_scores)
@@ -460,9 +471,3 @@ class OA(object):
                 
         else:
             self._logger.info("No data found for the ingest summary")
-
-
-
- 
-
-        


[28/42] incubator-spot git commit: Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/incubator-spot into pr/122

Posted by na...@apache.org.
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/incubator-spot into pr/122


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/e15d38c6
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/e15d38c6
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/e15d38c6

Branch: refs/heads/SPOT-181_ODM
Commit: e15d38c6df997729ee0e51b02a00366ef2c3a941
Parents: 671dfd7 935dfb4
Author: natedogs911 <na...@gmail.com>
Authored: Tue Jan 23 17:06:41 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Tue Jan 23 17:06:41 2018 -0800

----------------------------------------------------------------------
 spot-ingest/common/kafka_topic.sh               |   2 +-
 spot-ingest/master_collector.py                 |  60 ++---
 spot-ingest/start_ingest_standalone.sh          |   2 +-
 spot-ingest/worker.py                           |  56 +++--
 spot-ml/ml_ops.sh                               |   2 +-
 spot-ml/ml_test.sh                              |   4 +-
 .../dns/model/DNSSuspiciousConnectsModel.scala  |  43 ++--
 .../org/apache/spot/lda/SpotLDAHelper.scala     | 173 ++++++++++++++
 .../org/apache/spot/lda/SpotLDAModel.scala      | 139 +++++++++++
 .../org/apache/spot/lda/SpotLDAResult.scala     |  43 ++++
 .../org/apache/spot/lda/SpotLDAWrapper.scala    | 226 +++---------------
 .../model/FlowSuspiciousConnectsModel.scala     |  27 +--
 .../proxy/ProxySuspiciousConnectsModel.scala    |  25 +-
 .../org/apache/spot/utilities/TopDomains.scala  |   1 -
 .../org/apache/spot/lda/SpotLDAHelperTest.scala | 133 +++++++++++
 .../apache/spot/lda/SpotLDAWrapperTest.scala    | 236 ++++++-------------
 spot-oa/api/resources/flow.py                   |   6 +-
 spot-oa/oa/dns/dns_oa.py                        |   4 +-
 spot-oa/requirements.txt                        |   2 +-
 spot-oa/runIpython.sh                           |   2 +-
 spot-setup/hdfs_setup.sh                        |   2 +-
 21 files changed, 712 insertions(+), 476 deletions(-)
----------------------------------------------------------------------



[22/42] incubator-spot git commit: [SPOT-213][SPOT-77] updated requirements and documentation to support Kerberos for ingest

Posted by na...@apache.org.
[SPOT-213][SPOT-77] updated requirements and documentation to support Kerberos for ingest


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/13e35fc1
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/13e35fc1
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/13e35fc1

Branch: refs/heads/SPOT-181_ODM
Commit: 13e35fc1fa0a48b6df882341ab3c8e1e98324203
Parents: 49f4934
Author: natedogs911 <na...@gmail.com>
Authored: Thu Jan 18 15:24:23 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Thu Jan 18 15:24:23 2018 -0800

----------------------------------------------------------------------
 spot-ingest/KERBEROS.md               | 50 ++++++++++++++++++++++++++++++
 spot-ingest/README.md                 |  6 ++++
 spot-ingest/kerberos-requirements.txt |  4 +++
 spot-ingest/requirements.txt          |  5 ++-
 4 files changed, 64 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/13e35fc1/spot-ingest/KERBEROS.md
----------------------------------------------------------------------
diff --git a/spot-ingest/KERBEROS.md b/spot-ingest/KERBEROS.md
new file mode 100644
index 0000000..2c4c034
--- /dev/null
+++ b/spot-ingest/KERBEROS.md
@@ -0,0 +1,50 @@
+## Kerberos support installation
+
+run the following in addition to the typical installation instructions
+
+### Spot-Ingest
+
+`pip install -r ./spot-ingest/kerberos-requirements.txt`
+
+### Spot-OA
+
+`pip install -r ./spot-ingest/kerberos-requirements.txt`
+
+
+## spot.conf
+
+KERBEROS       =  set `KERBEROS='true'` in /etc/spot.conf to enable kerberos
+KEYTAB         =  should be generated using `ktutil` or another approved method
+SASL_MECH      =  should be set to `sasl_plaintext` unless using ssl
+KAFKA_SERVICE  =  if not set defaults will be used
+
+SSL            =  enable ssl by setting to true
+SSL_VERIFY     =  by setting to `false` disables host checking **important** only recommended in non production environments
+CA_LOCATION    =  location of certificate authority file
+CERT           =  host certificate
+KEY            =  key required for host certificate
+
+sample below:
+
+```
+#kerberos config
+KERBEROS='true'
+KINIT=/usr/bin/kinit
+PRINCIPAL='spot'
+KEYTAB='/opt/security/spot.keytab'
+SASL_MECH='GSSAPI'
+SECURITY_PROTO='sasl_plaintext'
+KAFKA_SERVICE_NAME=''
+
+#ssl config
+SSL='false'
+SSL_VERIFY='true'
+CA_LOCATION=''
+CERT=''
+KEY=''
+
+```
+
+Please see [LIBRDKAFKA Configurations](https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md)
+for reference to additional settings that can be set by modifying `spot-ingest/common/kafka_client.py`
+

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/13e35fc1/spot-ingest/README.md
----------------------------------------------------------------------
diff --git a/spot-ingest/README.md b/spot-ingest/README.md
index acfb382..ce4f4cc 100644
--- a/spot-ingest/README.md
+++ b/spot-ingest/README.md
@@ -20,6 +20,12 @@ Ingest data is captured or transferred into the Hadoop cluster, where they are t
 ### Install
 1. Install Python dependencies `pip install -r requirements.txt` 
 
+Optional:
+2. the sasl python package requires the following:
+   * Centos: `yum install cyrus-sasl-devel`
+   * Debian/Ubuntu: `apt-get install libsasl2-dev`
+3. install Python dependencies for Kerberos `pip install -r kerberos-requirements.txt`
+
 ### Configure Kafka
 **Adding Kafka Service:**
 

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/13e35fc1/spot-ingest/kerberos-requirements.txt
----------------------------------------------------------------------
diff --git a/spot-ingest/kerberos-requirements.txt b/spot-ingest/kerberos-requirements.txt
new file mode 100644
index 0000000..ae5ea26
--- /dev/null
+++ b/spot-ingest/kerberos-requirements.txt
@@ -0,0 +1,4 @@
+thrift_sasl==0.2.1
+sasl
+hdfs[kerberos]
+requests-kerberos
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/13e35fc1/spot-ingest/requirements.txt
----------------------------------------------------------------------
diff --git a/spot-ingest/requirements.txt b/spot-ingest/requirements.txt
index 7d04054..71661bc 100644
--- a/spot-ingest/requirements.txt
+++ b/spot-ingest/requirements.txt
@@ -1,2 +1,5 @@
 watchdog
-kafka-python
+confluent-kafka
+impyla
+hdfs
+six >= 1.5


[11/42] incubator-spot git commit: 'master' into pr/131, SPOT-244 to close apache/incubator-spot#131

Posted by na...@apache.org.
'master' into pr/131, SPOT-244 to close apache/incubator-spot#131


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/a07b3ebf
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/a07b3ebf
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/a07b3ebf

Branch: refs/heads/SPOT-181_ODM
Commit: a07b3ebf7ecc15f6cb7e0fe848541699fe083c40
Parents: 10256f4 dbf6f51
Author: natedogs911 <na...@gmail.com>
Authored: Tue Jan 9 13:37:08 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Tue Jan 9 13:39:01 2018 -0800

----------------------------------------------------------------------
 dev/release/README.md                           | 474 +++++++++++++++++++
 .../dns/model/DNSSuspiciousConnectsModel.scala  |  43 +-
 .../org/apache/spot/lda/SpotLDAHelper.scala     | 173 +++++++
 .../org/apache/spot/lda/SpotLDAModel.scala      | 139 ++++++
 .../org/apache/spot/lda/SpotLDAResult.scala     |  43 ++
 .../org/apache/spot/lda/SpotLDAWrapper.scala    | 226 ++-------
 .../model/FlowSuspiciousConnectsModel.scala     |  27 +-
 .../proxy/ProxySuspiciousConnectsModel.scala    |  25 +-
 .../org/apache/spot/utilities/TopDomains.scala  |   1 -
 .../org/apache/spot/lda/SpotLDAHelperTest.scala | 133 ++++++
 .../apache/spot/lda/SpotLDAWrapperTest.scala    | 236 +++------
 11 files changed, 1109 insertions(+), 411 deletions(-)
----------------------------------------------------------------------



[05/42] incubator-spot git commit: Spot-196: Changes: - SpotLDAWrapper: reverted leftover changes on ldaAlpha and ldaBeta - SpotLDAHelper: changed to private some attributes that are already exposed in SpotLDAResults

Posted by na...@apache.org.
Spot-196: Changes:
- SpotLDAWrapper: reverted leftover changes on ldaAlpha and ldaBeta
- SpotLDAHelper: changed to private some attributes that are already exposed in SpotLDAResults


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/dbdcbafe
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/dbdcbafe
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/dbdcbafe

Branch: refs/heads/SPOT-181_ODM
Commit: dbdcbafe0aee4c1140a4046775109b1c61019fe6
Parents: 45c03ab
Author: Ricardo Barona <ri...@intel.com>
Authored: Fri Aug 4 12:48:06 2017 -0500
Committer: Ricardo Barona <ri...@intel.com>
Committed: Fri Oct 6 15:25:58 2017 -0500

----------------------------------------------------------------------
 .../src/main/scala/org/apache/spot/lda/SpotLDAHelper.scala    | 7 ++++---
 .../test/scala/org/apache/spot/lda/SpotLDAWrapperTest.scala   | 4 ++--
 2 files changed, 6 insertions(+), 5 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/dbdcbafe/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAHelper.scala
----------------------------------------------------------------------
diff --git a/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAHelper.scala b/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAHelper.scala
index 8e771cb..e9f0b66 100644
--- a/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAHelper.scala
+++ b/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAHelper.scala
@@ -31,9 +31,10 @@ import scala.collection.immutable.Map
   */
 class SpotLDAHelper(private final val sparkSession: SparkSession,
                     final val docWordCount: RDD[SpotLDAInput],
-                    final val documentDictionary: DataFrame,
-                    final val wordDictionary: Map[String, Int],
-                    final val precisionUtility: FloatPointPrecisionUtility = FloatPointPrecisionUtility64) extends Serializable {
+                    private final val documentDictionary: DataFrame,
+                    private final val wordDictionary: Map[String, Int],
+                    private final val precisionUtility: FloatPointPrecisionUtility = FloatPointPrecisionUtility64)
+  extends Serializable {
 
   /**
     * Format document word count as RDD[(Long, Vector)] - input data for LDA algorithm

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/dbdcbafe/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAWrapperTest.scala
----------------------------------------------------------------------
diff --git a/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAWrapperTest.scala b/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAWrapperTest.scala
index 7007ba1..ae25d89 100644
--- a/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAWrapperTest.scala
+++ b/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAWrapperTest.scala
@@ -62,8 +62,8 @@ class SpotLDAWrapperTest extends TestingSparkContextFlatSpec with Matchers {
     val logger = LogManager.getLogger("SuspiciousConnectsAnalysis")
     logger.setLevel(Level.WARN)
 
-    val ldaAlpha = 1.002
-    val ldaBeta = 1.0001
+    val ldaAlpha = 1.02
+    val ldaBeta = 1.001
     val ldaMaxIterations = 100
 
     val optimizer = "em"


[18/42] incubator-spot git commit: [OA][PEP 8] reformat utils.py

Posted by na...@apache.org.
[OA][PEP 8] reformat utils.py


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/afeb0994
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/afeb0994
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/afeb0994

Branch: refs/heads/SPOT-181_ODM
Commit: afeb0994fe8b46bb510a6947e367391d12930e9e
Parents: 0e74919
Author: natedogs911 <na...@gmail.com>
Authored: Thu Jan 18 11:08:56 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Thu Jan 18 11:08:56 2018 -0800

----------------------------------------------------------------------
 spot-oa/oa/utils.py | 282 +++++++++++++++++++++++------------------------
 1 file changed, 136 insertions(+), 146 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/afeb0994/spot-oa/oa/utils.py
----------------------------------------------------------------------
diff --git a/spot-oa/oa/utils.py b/spot-oa/oa/utils.py
index 2bed10e..8ac6555 100644
--- a/spot-oa/oa/utils.py
+++ b/spot-oa/oa/utils.py
@@ -22,115 +22,114 @@ import csv
 import sys
 import ConfigParser
 
+
 class Util(object):
-	
-	@classmethod
-	def get_logger(cls,logger_name,create_file=False):
-		
-
-		# create logger for prd_ci
-		log = logging.getLogger(logger_name)
-		log.setLevel(level=logging.INFO)
-		
-		# create formatter and add it to the handlers
-		formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
-		
-		if create_file:
-				# create file handler for logger.
-				fh = logging.FileHandler('oa.log')
-				fh.setLevel(level=logging.DEBUG)
-				fh.setFormatter(formatter)
-		# reate console handler for logger.
-		ch = logging.StreamHandler()
-		ch.setLevel(level=logging.DEBUG)
-		ch.setFormatter(formatter)
-
-		# add handlers to logger.
-		if create_file:
-			log.addHandler(fh)
-
-		log.addHandler(ch)
-		return  log
-
-	@classmethod
-	def get_spot_conf(cls):
-		
-		conf_file = "/etc/spot.conf"
-		config = ConfigParser.ConfigParser()
-		config.readfp(SecHead(open(conf_file)))	
-
-		return config
-	
-	@classmethod
-	def create_oa_folders(cls,type,date):		
-
-		# create date and ingest summary folder structure if they don't' exist.
-		root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
-		data_type_folder = "{0}/data/{1}/{2}"
-		if not os.path.isdir(data_type_folder.format(root_path,type,date)): os.makedirs(data_type_folder.format(root_path,type,date))
-		if not os.path.isdir(data_type_folder.format(root_path,type,"ingest_summary")): os.makedirs(data_type_folder.format(root_path,type,"ingest_summary"))
-
-		# create ipynb folders.
-		ipynb_folder = "{0}/ipynb/{1}/{2}".format(root_path,type,date)
-		if not os.path.isdir(ipynb_folder): os.makedirs(ipynb_folder)
-
-		# retun path to folders.
-		data_path = data_type_folder.format(root_path,type,date)
-		ingest_path = data_type_folder.format(root_path,type,"ingest_summary")		
-		return data_path,ingest_path,ipynb_folder
-	
-	@classmethod
-	def get_ml_results_form_hdfs(cls,hdfs_file_path,local_path):
-
-		# get results from hdfs.
-		get_results_cmd = "hadoop fs -get {0} {1}/.".format(hdfs_file_path,local_path)
-		subprocess.call(get_results_cmd,shell=True)
-		return get_results_cmd
-
-	@classmethod
-	def read_results(cls,file,limit, delimiter=','):
-		
-		# read csv results.
-		result_rows = []
-		with open(file, 'rb') as results_file:
-			csv_reader = csv.reader(results_file, delimiter = delimiter)
-			for i in range(0, int(limit)):
-				try:
-					row = csv_reader.next()
-				except StopIteration:
-					return result_rows
-				result_rows.append(row)
-		return result_rows
-
-	@classmethod
-	def ip_to_int(self,ip):
-		
-		try:
-			o = map(int, ip.split('.'))
-			res = (16777216 * o[0]) + (65536 * o[1]) + (256 * o[2]) + o[3]
-			return res    
-
-		except ValueError:
-			return None
-	
-	
-	@classmethod
-	def create_csv_file(cls,full_path_file,content,delimiter=','):   
-		with open(full_path_file, 'w+') as u_file:
-			writer = csv.writer(u_file, quoting=csv.QUOTE_NONE, delimiter=delimiter)
-			writer.writerows(content)
-
-
-	@classmethod
-    	def cast_val(self,value):
-       	    try: 
-            	val = int(value) 
+    @classmethod
+    def get_logger(cls, logger_name, create_file=False):
+
+        # create logger for prd_ci
+        log = logging.getLogger(logger_name)
+        log.setLevel(level=logging.INFO)
+
+        # create formatter and add it to the handlers
+        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+
+        if create_file:
+            # create file handler for logger.
+            fh = logging.FileHandler('oa.log')
+            fh.setLevel(level=logging.DEBUG)
+            fh.setFormatter(formatter)
+        # reate console handler for logger.
+        ch = logging.StreamHandler()
+        ch.setLevel(level=logging.DEBUG)
+        ch.setFormatter(formatter)
+
+        # add handlers to logger.
+        if create_file:
+            log.addHandler(fh)
+
+        log.addHandler(ch)
+        return log
+
+    @classmethod
+    def get_spot_conf(cls):
+
+        conf_file = "/etc/spot.conf"
+        config = ConfigParser.ConfigParser()
+        config.readfp(SecHead(open(conf_file)))
+
+        return config
+
+    @classmethod
+    def create_oa_folders(cls, type, date):
+
+        # create date and ingest summary folder structure if they don't' exist.
+        root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+        data_type_folder = "{0}/data/{1}/{2}"
+        if not os.path.isdir(data_type_folder.format(root_path, type, date)): os.makedirs(
+            data_type_folder.format(root_path, type, date))
+        if not os.path.isdir(data_type_folder.format(root_path, type, "ingest_summary")): os.makedirs(
+            data_type_folder.format(root_path, type, "ingest_summary"))
+
+        # create ipynb folders.
+        ipynb_folder = "{0}/ipynb/{1}/{2}".format(root_path, type, date)
+        if not os.path.isdir(ipynb_folder): os.makedirs(ipynb_folder)
+
+        # retun path to folders.
+        data_path = data_type_folder.format(root_path, type, date)
+        ingest_path = data_type_folder.format(root_path, type, "ingest_summary")
+        return data_path, ingest_path, ipynb_folder
+
+    @classmethod
+    def get_ml_results_form_hdfs(cls, hdfs_file_path, local_path):
+
+        # get results from hdfs.
+        get_results_cmd = "hadoop fs -get {0} {1}/.".format(hdfs_file_path, local_path)
+        subprocess.call(get_results_cmd, shell=True)
+        return get_results_cmd
+
+    @classmethod
+    def read_results(cls, file, limit, delimiter=','):
+
+        # read csv results.
+        result_rows = []
+        with open(file, 'rb') as results_file:
+            csv_reader = csv.reader(results_file, delimiter=delimiter)
+            for i in range(0, int(limit)):
+                try:
+                    row = csv_reader.next()
+                except StopIteration:
+                    return result_rows
+                result_rows.append(row)
+        return result_rows
+
+    @classmethod
+    def ip_to_int(self, ip):
+
+        try:
+            o = map(int, ip.split('.'))
+            res = (16777216 * o[0]) + (65536 * o[1]) + (256 * o[2]) + o[3]
+            return res
+
+        except ValueError:
+            return None
+
+    @classmethod
+    def create_csv_file(cls, full_path_file, content, delimiter=','):
+        with open(full_path_file, 'w+') as u_file:
+            writer = csv.writer(u_file, quoting=csv.QUOTE_NONE, delimiter=delimiter)
+            writer.writerows(content)
+
+    @classmethod
+    def cast_val(self, value):
+        try:
+            val = int(value)
+        except:
+            try:
+                val = float(value)
             except:
-            	try:
-                    val = float(value) 
-            	except:
-                    val = str(value) 
-            return val    
+                val = str(value)
+        return val
 
 
 class SecHead(object):
@@ -140,47 +139,38 @@ class SecHead(object):
 
     def readline(self):
         if self.sechead:
-            try: 
+            try:
                 return self.sechead
-            finally: 
+            finally:
                 self.sechead = None
-        else: 
+        else:
             return self.fp.readline()
 
-class ProgressBar(object):
-
-	def __init__(self,total,prefix='',sufix='',decimals=2,barlength=60):
-
-		self._total = total
-		self._prefix = prefix
-		self._sufix = sufix
-		self._decimals = decimals
-		self._bar_length = barlength
-		self._auto_iteration_status = 0
-
-	def start(self):
-
-		self._move_progress_bar(0)
-	
-	def update(self,iterator):
-		
-		self._move_progress_bar(iterator)
-
-	def auto_update(self):
-
-		self._auto_iteration_status += 1		
-		self._move_progress_bar(self._auto_iteration_status)
-	
-	def _move_progress_bar(self,iteration):
-
-		filledLength    = int(round(self._bar_length * iteration / float(self._total)))
-		percents        = round(100.00 * (iteration / float(self._total)), self._decimals)
-		bar             = '#' * filledLength + '-' * (self._bar_length - filledLength)	
-		sys.stdout.write("{0} [{1}] {2}% {3}\r".format(self._prefix, bar, percents, self._sufix))		
-		sys.stdout.flush()
-		
-		if iteration == self._total:print("\n")
-
-		
-	
 
+class ProgressBar(object):
+    def __init__(self, total, prefix='', sufix='', decimals=2, barlength=60):
+        self._total = total
+        self._prefix = prefix
+        self._sufix = sufix
+        self._decimals = decimals
+        self._bar_length = barlength
+        self._auto_iteration_status = 0
+
+    def start(self):
+        self._move_progress_bar(0)
+
+    def update(self, iterator):
+        self._move_progress_bar(iterator)
+
+    def auto_update(self):
+        self._auto_iteration_status += 1
+        self._move_progress_bar(self._auto_iteration_status)
+
+    def _move_progress_bar(self, iteration):
+        filledLength = int(round(self._bar_length * iteration / float(self._total)))
+        percents = round(100.00 * (iteration / float(self._total)), self._decimals)
+        bar = '#' * filledLength + '-' * (self._bar_length - filledLength)
+        sys.stdout.write("{0} [{1}] {2}% {3}\r".format(self._prefix, bar, percents, self._sufix))
+        sys.stdout.flush()
+
+        if iteration == self._total: print("\n")


[42/42] incubator-spot git commit: fixing merge conflicts master-->SPOT-181_ODM

Posted by na...@apache.org.
fixing merge conflicts master-->SPOT-181_ODM


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/ee4e17d7
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/ee4e17d7
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/ee4e17d7

Branch: refs/heads/SPOT-181_ODM
Commit: ee4e17d7e6961e8df18dada10d154bdb9f8bf259
Parents: 14dbd51 0e3ef34
Author: natedogs911 <na...@gmail.com>
Authored: Mon Mar 19 12:26:29 2018 -0700
Committer: natedogs911 <na...@gmail.com>
Committed: Mon Mar 19 12:26:29 2018 -0700

----------------------------------------------------------------------
 docs/open-data-model.md                         | 3310 ++++++++++++++++++
 spot-gen/README.md                              |   66 +
 spot-gen/conf/asa.yaml                          |   33 +
 spot-gen/conf/asa/asa.sample                    |   13 +
 .../conf/asa/not-supported-by-parser.sample     |   40 +
 spot-gen/conf/common/files.txt                  |    2 +
 spot-gen/conf/common/hosts.txt                  |    5 +
 spot-gen/conf/common/subjects.txt               |   14 +
 spot-gen/conf/common/users.txt                  |    5 +
 spot-gen/conf/common/users_info.txt             |    5 +
 spot-gen/conf/common/utils.py                   |   36 +
 spot-gen/conf/example.yaml                      |   35 +
 spot-gen/conf/example/domains.txt               |    2 +
 spot-gen/conf/example/events1.txt               |    2 +
 spot-gen/conf/example/utils.py                  |   19 +
 spot-gen/conf/unix.yaml                         |   14 +
 spot-gen/conf/unix/unix_events.sample           |    4 +
 spot-gen/conf/windows_nxlog.yaml                |   42 +
 .../conf/windows_nxlog/windows_nxlog.sample     |   25 +
 spot-gen/datagen.py                             |  227 ++
 spot-ingest/streamsets/README.md                |   27 +
 .../ODMCentrifyIdentityPlatformEventTCP.json    | 1096 ++++++
 spot-ingest/streamsets/images/ImportContext.png |  Bin 0 -> 61789 bytes
 .../streamsets/images/ImportPipeline.png        |  Bin 0 -> 65915 bytes
 .../streamsets/netflow/NetFlowODMandLegacy.json | 1463 ++++++++
 .../qualys/ODMQualysVulnerabilityContext.json   | 1276 +++++++
 .../qualys/ODMQualysVulnerabilityEvents.json    | 1245 +++++++
 .../streamsets/windows/ODMWindowsEventLogs.json |  943 +++++
 .../streamsets/windows/WindowsHTTPEdge.json     |  603 ++++
 spot-ml/ml_ops.sh                               |    9 +-
 .../org/apache/spot/SuspiciousConnects.scala    |   20 +-
 .../spot/SuspiciousConnectsArgumentParser.scala |   36 +-
 .../utilities/data/InputOutputDataHandler.scala |   10 +-
 .../SuspiciousConnectsArgumentParserTest.scala  |  121 +-
 spot-oa/api/graphql/webapp.py                   |    5 +
 spot-oa/arcadia/README.md                       |   84 +
 spot-oa/arcadia/spot_app.json                   |    1 +
 spot-oa/requirements.txt                        |    1 +
 spot-setup/README.md                            |    5 +
 spot-setup/create_email_parquet.hql             |   31 +
 spot-setup/create_wgdhcp_parquet.hql            |   24 +
 spot-setup/create_wgtraffic_parquet.hql         |   51 +
 spot-setup/create_windows_parquet.hql           |   45 +
 spot-setup/odm/README.md                        |   68 +
 spot-setup/odm/create_endpoint_context_avro.sql |   58 +
 spot-setup/odm/create_endpoint_context_pqt.sql  |   57 +
 spot-setup/odm/create_event_avro.sql            |  302 ++
 spot-setup/odm/create_event_pqt.sql             |  301 ++
 spot-setup/odm/create_network_context_avro.sql  |   48 +
 spot-setup/odm/create_network_context_pqt.sql   |   47 +
 .../create_threat_intelligence_context_avro.sql |   76 +
 .../create_threat_intelligence_context_pqt.sql  |   75 +
 spot-setup/odm/create_user_context_avro.sql     |   51 +
 spot-setup/odm/create_user_context_pqt.sql      |   50 +
 .../odm/create_vulnerability_context_avro.sql   |   32 +
 .../odm/create_vulnerability_context_pqt.sql    |   31 +
 spot-setup/odm/endpoint_context.avsc            |   44 +
 spot-setup/odm/event.avsc                       |  266 ++
 spot-setup/odm/network_context.avsc             |   34 +
 spot-setup/odm/odm_setup.sh                     |  197 ++
 spot-setup/odm/threat_intelligence_context.avsc |   62 +
 spot-setup/odm/user_context.avsc                |   37 +
 spot-setup/odm/vulnerability_context.avsc       |   18 +
 spot-setup/spot.conf                            |   36 +-
 .../views/hive/AdministrationActivity.sql       |  131 +
 .../views/hive/FileObjectAccessedOrChanged.sql  |  118 +
 spot-setup/views/hive/MessageEvent.sql          |   77 +
 spot-setup/views/hive/NetworkConnection.sql     |   86 +
 spot-setup/views/hive/PasswordChangeOrReset.sql |   45 +
 .../views/hive/ProcessStartupOrShutdown.sql     |   87 +
 .../hive/SecurityObjectAccessedOrChanged.sql    |  119 +
 spot-setup/views/hive/UseOfPrivilegeCommand.sql |   89 +
 .../views/hive/UserAccountAddedOrRemoved.sql    |  102 +
 spot-setup/views/hive/UserLogin.sql             |   89 +
 74 files changed, 13802 insertions(+), 26 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/ee4e17d7/spot-ml/ml_ops.sh
----------------------------------------------------------------------

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/ee4e17d7/spot-oa/requirements.txt
----------------------------------------------------------------------
diff --cc spot-oa/requirements.txt
index 2596e64,5461aae..2339c05
--- a/spot-oa/requirements.txt
+++ b/spot-oa/requirements.txt
@@@ -16,7 -16,8 +16,8 @@@ ipython == 3.2.
  # GraphQL API dependencies
  flask
  flask-graphql
+ flask-cors
 -graphql-core
 +graphql-core == 1.1.0
  urllib3
  
  # API Resources

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/ee4e17d7/spot-setup/README.md
----------------------------------------------------------------------

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/ee4e17d7/spot-setup/spot.conf
----------------------------------------------------------------------
diff --cc spot-setup/spot.conf
index aa08ea7,407e38f..6b3df85
--- a/spot-setup/spot.conf
+++ b/spot-setup/spot.conf
@@@ -84,3 -80,15 +80,15 @@@ PRECISION='64
  TOL='1e-6'
  TOPIC_COUNT=20
  DUPFACTOR=1000
+ 
+ # API CORS Options
+ #
+ #   ACCESS_CONTROL_ALLOW_ORIGIN:
+ #       Configuration type: string or comma seperated list
+ #
+ #   Examples:
+ #   '*' = Allow any origin (Default)
+ #   'http://trustedresource.com' = Allow specific origin
+ #   'http://trustedresource.com,http://anothertrustedresource.com' = Allow multiple origins
+ #
 -ACCESS_CONTROL_ALLOW_ORIGIN='*'
++ACCESS_CONTROL_ALLOW_ORIGIN='*'


[36/42] incubator-spot git commit: fixes for save_data()

Posted by na...@apache.org.
fixes for save_data()


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/2ea6b4ea
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/2ea6b4ea
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/2ea6b4ea

Branch: refs/heads/SPOT-181_ODM
Commit: 2ea6b4eac7c11ef6084955179eb211b696737e9e
Parents: b9befd7
Author: tpltnt <tp...@dropcut.net>
Authored: Thu Jan 25 12:01:31 2018 +0100
Committer: tpltnt <tp...@dropcut.net>
Committed: Thu Jan 25 12:01:31 2018 +0100

----------------------------------------------------------------------
 spot-ingest/pipelines/proxy/bluecoat.py | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/2ea6b4ea/spot-ingest/pipelines/proxy/bluecoat.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/proxy/bluecoat.py b/spot-ingest/pipelines/proxy/bluecoat.py
index c2ddb04..597c13c 100644
--- a/spot-ingest/pipelines/proxy/bluecoat.py
+++ b/spot-ingest/pipelines/proxy/bluecoat.py
@@ -148,21 +148,28 @@ def proxy_parser(proxy_fields):
     return proxy_parsed_data
 
 
-def save_data(rdd,sqc,db,db_table,topic):
+def save_data(rdd, sqc, db, db_table, topic):
     """
     Create and save a data frame with the given data.
+
+    :param rdd: collection of objects (Resilient Distributed Dataset) to store
+    :param sqc: Apache Hive context
+    :param db: Apache Hive database to save into
+    :param db_table: table of `db` to save into
+    :param topic: Apache Kafka topic to listen for (if `rdd` is empty)
     """
     if not rdd.isEmpty():
 
-        df = sqc.createDataFrame(rdd,proxy_schema)        
+        df = sqc.createDataFrame(rdd, proxy_schema)
         sqc.setConf("hive.exec.dynamic.partition", "true")
         sqc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
-        hive_table = "{0}.{1}".format(db,db_table)
+        hive_table = "{0}.{1}".format(db, db_table)
         df.write.format("parquet").mode("append").insertInto(hive_table)
 
     else:
         print("------------------------LISTENING KAFKA TOPIC:{0}------------------------".format(topic))
 
+
 def bluecoat_parse(zk,topic,db,db_table,num_of_workers,batch_size):
     
     app_name = topic


[26/42] incubator-spot git commit: [SPOT-213] fix readme location and typo

Posted by na...@apache.org.
[SPOT-213] fix readme location and typo


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/f594956e
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/f594956e
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/f594956e

Branch: refs/heads/SPOT-181_ODM
Commit: f594956e2b7fa3cd5a09fe2ad2fa5bc697cf347a
Parents: 41e51b8
Author: natedogs911 <na...@gmail.com>
Authored: Tue Jan 23 12:10:54 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Tue Jan 23 12:10:54 2018 -0800

----------------------------------------------------------------------
 spot-ingest/KERBEROS.md | 50 --------------------------------------------
 spot-setup/KERBEROS.md  | 50 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+), 50 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/f594956e/spot-ingest/KERBEROS.md
----------------------------------------------------------------------
diff --git a/spot-ingest/KERBEROS.md b/spot-ingest/KERBEROS.md
deleted file mode 100644
index 2c4c034..0000000
--- a/spot-ingest/KERBEROS.md
+++ /dev/null
@@ -1,50 +0,0 @@
-## Kerberos support installation
-
-run the following in addition to the typical installation instructions
-
-### Spot-Ingest
-
-`pip install -r ./spot-ingest/kerberos-requirements.txt`
-
-### Spot-OA
-
-`pip install -r ./spot-ingest/kerberos-requirements.txt`
-
-
-## spot.conf
-
-KERBEROS       =  set `KERBEROS='true'` in /etc/spot.conf to enable kerberos
-KEYTAB         =  should be generated using `ktutil` or another approved method
-SASL_MECH      =  should be set to `sasl_plaintext` unless using ssl
-KAFKA_SERVICE  =  if not set defaults will be used
-
-SSL            =  enable ssl by setting to true
-SSL_VERIFY     =  by setting to `false` disables host checking **important** only recommended in non production environments
-CA_LOCATION    =  location of certificate authority file
-CERT           =  host certificate
-KEY            =  key required for host certificate
-
-sample below:
-
-```
-#kerberos config
-KERBEROS='true'
-KINIT=/usr/bin/kinit
-PRINCIPAL='spot'
-KEYTAB='/opt/security/spot.keytab'
-SASL_MECH='GSSAPI'
-SECURITY_PROTO='sasl_plaintext'
-KAFKA_SERVICE_NAME=''
-
-#ssl config
-SSL='false'
-SSL_VERIFY='true'
-CA_LOCATION=''
-CERT=''
-KEY=''
-
-```
-
-Please see [LIBRDKAFKA Configurations](https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md)
-for reference to additional settings that can be set by modifying `spot-ingest/common/kafka_client.py`
-

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/f594956e/spot-setup/KERBEROS.md
----------------------------------------------------------------------
diff --git a/spot-setup/KERBEROS.md b/spot-setup/KERBEROS.md
new file mode 100644
index 0000000..a980d1c
--- /dev/null
+++ b/spot-setup/KERBEROS.md
@@ -0,0 +1,50 @@
+## Kerberos support installation
+
+run the following in addition to the typical installation instructions
+
+### Spot-Ingest
+
+`pip install -r ./spot-ingest/kerberos-requirements.txt`
+
+### Spot-OA
+
+`pip install -r ./spot-oa/kerberos-requirements.txt`
+
+
+## spot.conf
+
+KERBEROS       =  set `KERBEROS='true'` in /etc/spot.conf to enable kerberos
+KEYTAB         =  should be generated using `ktutil` or another approved method
+SASL_MECH      =  should be set to `sasl_plaintext` unless using ssl
+KAFKA_SERVICE  =  if not set defaults will be used
+
+SSL            =  enable ssl by setting to true
+SSL_VERIFY     =  by setting to `false` disables host checking **important** only recommended in non production environments
+CA_LOCATION    =  location of certificate authority file
+CERT           =  host certificate
+KEY            =  key required for host certificate
+
+sample below:
+
+```
+#kerberos config
+KERBEROS='true'
+KINIT=/usr/bin/kinit
+PRINCIPAL='spot'
+KEYTAB='/opt/security/spot.keytab'
+SASL_MECH='GSSAPI'
+SECURITY_PROTO='sasl_plaintext'
+KAFKA_SERVICE_NAME=''
+
+#ssl config
+SSL='false'
+SSL_VERIFY='true'
+CA_LOCATION=''
+CERT=''
+KEY=''
+
+```
+
+Please see [LIBRDKAFKA Configurations](https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md)
+for reference to additional settings that can be set by modifying `spot-ingest/common/kafka_client.py`
+


[27/42] incubator-spot git commit: Merge branch 'master' into pr/123 to close SPOT-254 and apache/incubator-spot#123

Posted by na...@apache.org.
Merge branch 'master' into pr/123 to close SPOT-254 and apache/incubator-spot#123


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/935dfb49
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/935dfb49
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/935dfb49

Branch: refs/heads/SPOT-181_ODM
Commit: 935dfb4992ead12b45e2ea7cb8e580d02678e08d
Parents: 3baa75a 6deaae3
Author: natedogs911 <na...@gmail.com>
Authored: Tue Jan 23 16:59:04 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Tue Jan 23 16:59:04 2018 -0800

----------------------------------------------------------------------
 spot-ingest/common/kafka_topic.sh               |   2 +-
 spot-ingest/master_collector.py                 |  60 ++---
 spot-ingest/start_ingest_standalone.sh          |   2 +-
 spot-ingest/worker.py                           |  56 +++--
 spot-ml/ml_ops.sh                               |   2 +-
 spot-ml/ml_test.sh                              |   4 +-
 .../dns/model/DNSSuspiciousConnectsModel.scala  |  43 ++--
 .../org/apache/spot/lda/SpotLDAHelper.scala     | 173 ++++++++++++++
 .../org/apache/spot/lda/SpotLDAModel.scala      | 139 +++++++++++
 .../org/apache/spot/lda/SpotLDAResult.scala     |  43 ++++
 .../org/apache/spot/lda/SpotLDAWrapper.scala    | 226 +++---------------
 .../model/FlowSuspiciousConnectsModel.scala     |  27 +--
 .../proxy/ProxySuspiciousConnectsModel.scala    |  25 +-
 .../org/apache/spot/utilities/TopDomains.scala  |   1 -
 .../org/apache/spot/lda/SpotLDAHelperTest.scala | 133 +++++++++++
 .../apache/spot/lda/SpotLDAWrapperTest.scala    | 236 ++++++-------------
 spot-oa/oa/dns/dns_oa.py                        |   4 +-
 spot-oa/requirements.txt                        |   2 +-
 spot-oa/runIpython.sh                           |   2 +-
 spot-setup/hdfs_setup.sh                        |   2 +-
 20 files changed, 709 insertions(+), 473 deletions(-)
----------------------------------------------------------------------



[20/42] incubator-spot git commit: [SPOT-213][SPOT-216] [setup] moved script files to support additional engines such as beeline, impala

Posted by na...@apache.org.
[SPOT-213][SPOT-216] [setup] moved script files to support additional engines such as beeline, impala


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/3383c07c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/3383c07c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/3383c07c

Branch: refs/heads/SPOT-181_ODM
Commit: 3383c07cbaf695953facdc3c269c01af992abaae
Parents: 8b600c8
Author: natedogs911 <na...@gmail.com>
Authored: Thu Jan 18 12:23:24 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Thu Jan 18 12:23:24 2018 -0800

----------------------------------------------------------------------
 spot-setup/beeline/create_dns_parquet.hql   | 162 +++++++++++++++++++
 spot-setup/beeline/create_flow_parquet.hql  | 194 ++++++++++++++++++++++
 spot-setup/beeline/create_proxy_parquet.hql | 179 ++++++++++++++++++++
 spot-setup/create_dns_parquet.hql           | 163 -------------------
 spot-setup/create_flow_parquet.hql          | 195 ----------------------
 spot-setup/create_proxy_parquet.hql         | 177 --------------------
 spot-setup/hive/create_dns_parquet.hql      | 165 +++++++++++++++++++
 spot-setup/hive/create_flow_parquet.hql     | 197 +++++++++++++++++++++++
 spot-setup/hive/create_proxy_parquet.hql    | 179 ++++++++++++++++++++
 spot-setup/impala/create_dns_parquet.hql    | 163 +++++++++++++++++++
 spot-setup/impala/create_flow_parquet.hql   | 195 ++++++++++++++++++++++
 spot-setup/impala/create_proxy_parquet.hql  | 177 ++++++++++++++++++++
 12 files changed, 1611 insertions(+), 535 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3383c07c/spot-setup/beeline/create_dns_parquet.hql
----------------------------------------------------------------------
diff --git a/spot-setup/beeline/create_dns_parquet.hql b/spot-setup/beeline/create_dns_parquet.hql
new file mode 100755
index 0000000..b9be108
--- /dev/null
+++ b/spot-setup/beeline/create_dns_parquet.hql
@@ -0,0 +1,162 @@
+
+-- Licensed to the Apache Software Foundation (ASF) under one or more
+-- contributor license agreements.  See the NOTICE file distributed with
+-- this work for additional information regarding copyright ownership.
+-- The ASF licenses this file to You under the Apache License, Version 2.0
+-- (the "License"); you may not use this file except in compliance with
+-- the License.  You may obtain a copy of the License at
+
+--    http://www.apache.org/licenses/LICENSE-2.0
+
+-- Unless required by applicable law or agreed to in writing, software
+-- distributed under the License is distributed on an "AS IS" BASIS,
+-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+-- See the License for the specific language governing permissions and
+-- limitations under the License.
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.dns (
+frame_time STRING, 
+unix_tstamp BIGINT,
+frame_len INT,
+ip_dst STRING,
+ip_src STRING,
+dns_qry_name STRING,
+dns_qry_class STRING,
+dns_qry_type INT,
+dns_qry_rcode INT,
+dns_a STRING
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT,
+h TINYINT
+)
+STORED AS PARQUET 
+LOCATION '${huser}/dns/hive';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.dns_dendro (
+unix_tstamp BIGINT,
+dns_a STRING,
+dns_qry_name STRING,
+ip_dst STRING
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/dns/hive/oa/dendro';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.dns_edge (
+unix_tstamp BIGINT,
+frame_len BIGINT,
+ip_dst STRING,
+ip_src STRING,
+dns_qry_name STRING,
+dns_qry_class STRING,
+dns_qry_type INT,
+dns_qry_rcode INT,
+dns_a STRING,
+hh INT,
+dns_qry_class_name STRING,
+dns_qry_type_name STRING,
+dns_qry_rcode_name STRING,
+network_context STRING
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/dns/hive/oa/edge';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.dns_ingest_summary (
+tdate STRING,
+total BIGINT
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/dns/hive/oa/summary';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.dns_scores (
+frame_time STRING, 
+unix_tstamp BIGINT,
+frame_len BIGINT,
+ip_dst STRING, 
+dns_qry_name STRING, 
+dns_qry_class STRING,
+dns_qry_type INT,
+dns_qry_rcode INT, 
+ml_score FLOAT,
+tld STRING,
+query_rep STRING,
+hh INT,
+dns_qry_class_name STRING, 
+dns_qry_type_name STRING,
+dns_qry_rcode_name STRING, 
+network_context STRING 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/dns/hive/oa/suspicious';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.dns_storyboard (
+ip_threat STRING,
+dns_threat STRING, 
+title STRING,
+text STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/dns/hive/oa/storyboard';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.dns_threat_dendro (
+anchor STRING, 
+total BIGINT,
+dns_qry_name STRING, 
+ip_dst STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/dns/hive/oa/threat_dendro';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.dns_threat_investigation (
+unix_tstamp BIGINT,
+ip_dst STRING, 
+dns_qry_name STRING, 
+ip_sev INT,
+dns_sev INT
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/dns/hive/oa/threat_investigation';

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3383c07c/spot-setup/beeline/create_flow_parquet.hql
----------------------------------------------------------------------
diff --git a/spot-setup/beeline/create_flow_parquet.hql b/spot-setup/beeline/create_flow_parquet.hql
new file mode 100755
index 0000000..25e860a
--- /dev/null
+++ b/spot-setup/beeline/create_flow_parquet.hql
@@ -0,0 +1,194 @@
+
+-- Licensed to the Apache Software Foundation (ASF) under one or more
+-- contributor license agreements.  See the NOTICE file distributed with
+-- this work for additional information regarding copyright ownership.
+-- The ASF licenses this file to You under the Apache License, Version 2.0
+-- (the "License"); you may not use this file except in compliance with
+-- the License.  You may obtain a copy of the License at
+
+--    http://www.apache.org/licenses/LICENSE-2.0
+
+-- Unless required by applicable law or agreed to in writing, software
+-- distributed under the License is distributed on an "AS IS" BASIS,
+-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+-- See the License for the specific language governing permissions and
+-- limitations under the License.
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.flow (
+treceived STRING,
+unix_tstamp BIGINT,
+tryear INT,
+trmonth INT,
+trday INT,
+trhour INT,
+trminute INT,
+trsec INT,
+tdur FLOAT,
+sip STRING,
+dip STRING,
+sport INT,
+dport INT,
+proto STRING,
+flag STRING,
+fwd INT,
+stos INT,
+ipkt BIGINT,
+ibyt BIGINT,
+opkt BIGINT, 
+obyt BIGINT,
+input INT,
+output INT,
+sas INT,
+das INT,
+dtos INT,
+dir INT,
+rip STRING
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT,
+h TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/flow/hive';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.flow_chords (
+ip_threat STRING,
+srcip STRING,
+dstip STRING,
+ibyt BIGINT, 
+ipkt BIGINT
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/flow/hive/oa/chords';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.flow_edge (
+tstart STRING, 
+srcip STRING,
+dstip STRING,
+sport INT, 
+dport INT, 
+proto STRING,
+flags STRING,
+tos INT, 
+ibyt BIGINT, 
+ipkt BIGINT, 
+input BIGINT,
+output BIGINT, 
+rip STRING,
+obyt BIGINT, 
+opkt BIGINT, 
+hh INT,
+mn INT 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/flow/hive/oa/edge';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.flow_ingest_summary (
+tdate STRING,
+total BIGINT 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/flow/hive/oa/summary';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.flow_scores (
+tstart STRING, 
+srcip STRING,
+dstip STRING,
+sport INT, 
+dport INT, 
+proto STRING,
+ipkt INT,
+ibyt INT,
+opkt INT,
+obyt INT,
+ml_score FLOAT,
+rank INT,
+srcip_INTernal INT,
+dstip_INTernal INT,
+src_geoloc STRING, 
+dst_geoloc STRING, 
+src_domain STRING, 
+dst_domain STRING, 
+src_rep STRING,
+dst_rep STRING 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/flow/hive/oa/suspicious';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.flow_storyboard (
+ip_threat STRING,
+title STRING,
+text STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/flow/hive/oa/storyboard';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.flow_threat_investigation (
+tstart STRING,
+srcip STRING, 
+dstip STRING, 
+srcport INT,
+dstport INT,
+score INT 
+) 
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+) 
+STORED AS PARQUET 
+LOCATION '${huser}/flow/hive/oa/threat_investigation';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.flow_timeline (
+ip_threat STRING,
+tstart STRING, 
+tend STRING, 
+srcip STRING,
+dstip STRING,
+proto STRING,
+sport INT, 
+dport INT, 
+ipkt BIGINT, 
+ibyt BIGINT
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/flow/hive/oa/timeline';

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3383c07c/spot-setup/beeline/create_proxy_parquet.hql
----------------------------------------------------------------------
diff --git a/spot-setup/beeline/create_proxy_parquet.hql b/spot-setup/beeline/create_proxy_parquet.hql
new file mode 100755
index 0000000..d9cd79f
--- /dev/null
+++ b/spot-setup/beeline/create_proxy_parquet.hql
@@ -0,0 +1,179 @@
+
+-- Licensed to the Apache Software Foundation (ASF) under one or more
+-- contributor license agreements.  See the NOTICE file distributed with
+-- this work for additional information regarding copyright ownership.
+-- The ASF licenses this file to You under the Apache License, Version 2.0
+-- (the "License"); you may not use this file except in compliance with
+-- the License.  You may obtain a copy of the License at
+
+--    http://www.apache.org/licenses/LICENSE-2.0
+
+-- Unless required by applicable law or agreed to in writing, software
+-- distributed under the License is distributed on an "AS IS" BASIS,
+-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+-- See the License for the specific language governing permissions and
+-- limitations under the License.
+
+SET huser;
+SET dbname;
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.proxy (
+p_date STRING,
+p_time STRING,
+clientip STRING,
+host STRING,
+reqmethod STRING,
+useragent STRING,
+resconttype STRING,
+duration INT,
+username STRING,
+authgroup STRING,
+exceptionid STRING,
+filterresult STRING,
+webcat STRING,
+referer STRING,
+respcode STRING,
+action STRING,
+urischeme STRING,
+uriport STRING,
+uripath STRING,
+uriquery STRING,
+uriextension STRING,
+serverip STRING,
+scbytes INT,
+csbytes INT,
+virusid STRING,
+bcappname STRING,
+bcappoper STRING,
+fulluri STRING
+)
+PARTITIONED BY (
+y STRING,
+m STRING,
+d STRING,
+h STRING
+)
+STORED AS PARQUET
+LOCATION '${huser}/proxy/hive';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.proxy_edge (
+tdate STRING,
+time STRING, 
+clientip STRING, 
+host STRING, 
+webcat STRING, 
+respcode STRING, 
+reqmethod STRING,
+useragent STRING,
+resconttype STRING,
+referer STRING,
+uriport STRING,
+serverip STRING, 
+scbytes INT, 
+csbytes INT, 
+fulluri STRING,
+hh INT,
+respcode_name STRING 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/proxy/hive/oa/edge';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.proxy_ingest_summary (
+tdate STRING,
+total BIGINT 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/proxy/hive/oa/summary';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.proxy_scores (
+tdate STRING,
+time STRING, 
+clientip STRING, 
+host STRING, 
+reqmethod STRING,
+useragent STRING,
+resconttype STRING,
+duration INT,
+username STRING, 
+webcat STRING, 
+referer STRING,
+respcode INT,
+uriport INT, 
+uripath STRING,
+uriquery STRING, 
+serverip STRING, 
+scbytes INT, 
+csbytes INT, 
+fulluri STRING,
+word STRING, 
+ml_score FLOAT,
+uri_rep STRING,
+respcode_name STRING,
+network_context STRING 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/proxy/hive/oa/suspicious';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.proxy_storyboard (
+p_threat STRING, 
+title STRING,
+text STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/proxy/hive/oa/storyboard';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.proxy_threat_investigation (
+tdate STRING,
+fulluri STRING,
+uri_sev INT
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/proxy/hive/oa/threat_investigation';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${dbname}.proxy_timeline (
+p_threat STRING, 
+tstart STRING, 
+tend STRING, 
+duration BIGINT, 
+clientip STRING, 
+respcode STRING, 
+respcodename STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${huser}/proxy/hive/oa/timeline';

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3383c07c/spot-setup/create_dns_parquet.hql
----------------------------------------------------------------------
diff --git a/spot-setup/create_dns_parquet.hql b/spot-setup/create_dns_parquet.hql
deleted file mode 100755
index 38025c6..0000000
--- a/spot-setup/create_dns_parquet.hql
+++ /dev/null
@@ -1,163 +0,0 @@
-
--- Licensed to the Apache Software Foundation (ASF) under one or more
--- contributor license agreements.  See the NOTICE file distributed with
--- this work for additional information regarding copyright ownership.
--- The ASF licenses this file to You under the Apache License, Version 2.0
--- (the "License"); you may not use this file except in compliance with
--- the License.  You may obtain a copy of the License at
-
---    http://www.apache.org/licenses/LICENSE-2.0
-
--- Unless required by applicable law or agreed to in writing, software
--- distributed under the License is distributed on an "AS IS" BASIS,
--- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
--- See the License for the specific language governing permissions and
--- limitations under the License.
-
-
-CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.dns (
-frame_time STRING, 
-unix_tstamp BIGINT,
-frame_len INT,
-ip_dst STRING,
-ip_src STRING,
-dns_qry_name STRING,
-dns_qry_class STRING,
-dns_qry_type INT,
-dns_qry_rcode INT,
-dns_a STRING
-)
-PARTITIONED BY (
-y SMALLINT,
-m TINYINT,
-d TINYINT,
-h TINYINT
-)
-STORED AS PARQUET 
-LOCATION '${var:huser}/dns/hive';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.dns_dendro (
-unix_tstamp BIGINT,
-dns_a STRING,
-dns_qry_name STRING,
-ip_dst STRING
-)
-PARTITIONED BY (
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/dns/hive/oa/dendro';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.dns_edge ( 
-unix_tstamp BIGINT,
-frame_len BIGINT,
-ip_dst STRING,
-ip_src STRING,
-dns_qry_name STRING,
-dns_qry_class STRING,
-dns_qry_type INT,
-dns_qry_rcode INT,
-dns_a STRING,
-hh INT,
-dns_qry_class_name STRING,
-dns_qry_type_name STRING,
-dns_qry_rcode_name STRING,
-network_context STRING
-)
-PARTITIONED BY (
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/dns/hive/oa/edge';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.dns_ingest_summary ( 
-tdate STRING,
-total BIGINT
-)
-PARTITIONED BY (
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/dns/hive/oa/summary';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.dns_scores ( 
-frame_time STRING, 
-unix_tstamp BIGINT,
-frame_len BIGINT,
-ip_dst STRING, 
-dns_qry_name STRING, 
-dns_qry_class STRING,
-dns_qry_type INT,
-dns_qry_rcode INT, 
-ml_score FLOAT,
-tld STRING,
-query_rep STRING,
-hh INT,
-dns_qry_class_name STRING, 
-dns_qry_type_name STRING,
-dns_qry_rcode_name STRING, 
-network_context STRING 
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/dns/hive/oa/suspicious';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.dns_storyboard ( 
-ip_threat STRING,
-dns_threat STRING, 
-title STRING,
-text STRING
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/dns/hive/oa/storyboard';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.dns_threat_dendro (
-anchor STRING, 
-total BIGINT,
-dns_qry_name STRING, 
-ip_dst STRING
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/dns/hive/oa/threat_dendro';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.dns_threat_investigation ( 
-unix_tstamp BIGINT,
-ip_dst STRING, 
-dns_qry_name STRING, 
-ip_sev INT,
-dns_sev INT
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/dns/hive/oa/threat_investigation';

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3383c07c/spot-setup/create_flow_parquet.hql
----------------------------------------------------------------------
diff --git a/spot-setup/create_flow_parquet.hql b/spot-setup/create_flow_parquet.hql
deleted file mode 100755
index 41c4819..0000000
--- a/spot-setup/create_flow_parquet.hql
+++ /dev/null
@@ -1,195 +0,0 @@
-
--- Licensed to the Apache Software Foundation (ASF) under one or more
--- contributor license agreements.  See the NOTICE file distributed with
--- this work for additional information regarding copyright ownership.
--- The ASF licenses this file to You under the Apache License, Version 2.0
--- (the "License"); you may not use this file except in compliance with
--- the License.  You may obtain a copy of the License at
-
---    http://www.apache.org/licenses/LICENSE-2.0
-
--- Unless required by applicable law or agreed to in writing, software
--- distributed under the License is distributed on an "AS IS" BASIS,
--- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
--- See the License for the specific language governing permissions and
--- limitations under the License.
-
-
-CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.flow (
-treceived STRING,
-unix_tstamp BIGINT,
-tryear INT,
-trmonth INT,
-trday INT,
-trhour INT,
-trminute INT,
-trsec INT,
-tdur FLOAT,
-sip STRING,
-dip STRING,
-sport INT,
-dport INT,
-proto STRING,
-flag STRING,
-fwd INT,
-stos INT,
-ipkt BIGINT,
-ibyt BIGINT,
-opkt BIGINT, 
-obyt BIGINT,
-input INT,
-output INT,
-sas INT,
-das INT,
-dtos INT,
-dir INT,
-rip STRING
-)
-PARTITIONED BY (
-y SMALLINT,
-m TINYINT,
-d TINYINT,
-h TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/flow/hive';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.flow_chords (
-ip_threat STRING,
-srcip STRING,
-dstip STRING,
-ibyt BIGINT, 
-ipkt BIGINT
-)
-PARTITIONED BY (
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/flow/hive/oa/chords';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.flow_edge (
-tstart STRING, 
-srcip STRING,
-dstip STRING,
-sport INT, 
-dport INT, 
-proto STRING,
-flags STRING,
-tos INT, 
-ibyt BIGINT, 
-ipkt BIGINT, 
-input BIGINT,
-output BIGINT, 
-rip STRING,
-obyt BIGINT, 
-opkt BIGINT, 
-hh INT,
-mn INT 
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/flow/hive/oa/edge';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.flow_ingest_summary (
-tdate STRING,
-total BIGINT 
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/flow/hive/oa/summary';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.flow_scores (
-tstart STRING, 
-srcip STRING,
-dstip STRING,
-sport INT, 
-dport INT, 
-proto STRING,
-ipkt INT,
-ibyt INT,
-opkt INT,
-obyt INT,
-ml_score FLOAT,
-rank INT,
-srcip_INTernal INT,
-dstip_INTernal INT,
-src_geoloc STRING, 
-dst_geoloc STRING, 
-src_domain STRING, 
-dst_domain STRING, 
-src_rep STRING,
-dst_rep STRING 
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/flow/hive/oa/suspicious';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.flow_storyboard (
-ip_threat STRING,
-title STRING,
-text STRING
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/flow/hive/oa/storyboard';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.flow_threat_investigation ( 
-tstart STRING,
-srcip STRING, 
-dstip STRING, 
-srcport INT,
-dstport INT,
-score INT 
-) 
-PARTITIONED BY (
-y SMALLINT,
-m TINYINT,
-d TINYINT
-) 
-STORED AS PARQUET 
-LOCATION '${var:huser}/flow/hive/oa/threat_investigation';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.flow_timeline (
-ip_threat STRING,
-tstart STRING, 
-tend STRING, 
-srcip STRING,
-dstip STRING,
-proto STRING,
-sport INT, 
-dport INT, 
-ipkt BIGINT, 
-ibyt BIGINT
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/flow/hive/oa/timeline';

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3383c07c/spot-setup/create_proxy_parquet.hql
----------------------------------------------------------------------
diff --git a/spot-setup/create_proxy_parquet.hql b/spot-setup/create_proxy_parquet.hql
deleted file mode 100755
index f665dc2..0000000
--- a/spot-setup/create_proxy_parquet.hql
+++ /dev/null
@@ -1,177 +0,0 @@
-
--- Licensed to the Apache Software Foundation (ASF) under one or more
--- contributor license agreements.  See the NOTICE file distributed with
--- this work for additional information regarding copyright ownership.
--- The ASF licenses this file to You under the Apache License, Version 2.0
--- (the "License"); you may not use this file except in compliance with
--- the License.  You may obtain a copy of the License at
-
---    http://www.apache.org/licenses/LICENSE-2.0
-
--- Unless required by applicable law or agreed to in writing, software
--- distributed under the License is distributed on an "AS IS" BASIS,
--- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
--- See the License for the specific language governing permissions and
--- limitations under the License.
-
-
-CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.proxy (
-p_date STRING,
-p_time STRING,
-clientip STRING,
-host STRING,
-reqmethod STRING,
-useragent STRING,
-resconttype STRING,
-duration INT,
-username STRING,
-authgroup STRING,
-exceptionid STRING,
-filterresult STRING,
-webcat STRING,
-referer STRING,
-respcode STRING,
-action STRING,
-urischeme STRING,
-uriport STRING,
-uripath STRING,
-uriquery STRING,
-uriextension STRING,
-serverip STRING,
-scbytes INT,
-csbytes INT,
-virusid STRING,
-bcappname STRING,
-bcappoper STRING,
-fulluri STRING
-)
-PARTITIONED BY (
-y STRING,
-m STRING,
-d STRING,
-h STRING
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/proxy/hive';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.proxy_edge ( 
-tdate STRING,
-time STRING, 
-clientip STRING, 
-host STRING, 
-webcat STRING, 
-respcode STRING, 
-reqmethod STRING,
-useragent STRING,
-resconttype STRING,
-referer STRING,
-uriport STRING,
-serverip STRING, 
-scbytes INT, 
-csbytes INT, 
-fulluri STRING,
-hh INT,
-respcode_name STRING 
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/proxy/hive/oa/edge';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.proxy_ingest_summary ( 
-tdate STRING,
-total BIGINT 
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/proxy/hive/oa/summary';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.proxy_scores ( 
-tdate STRING,
-time STRING, 
-clientip STRING, 
-host STRING, 
-reqmethod STRING,
-useragent STRING,
-resconttype STRING,
-duration INT,
-username STRING, 
-webcat STRING, 
-referer STRING,
-respcode INT,
-uriport INT, 
-uripath STRING,
-uriquery STRING, 
-serverip STRING, 
-scbytes INT, 
-csbytes INT, 
-fulluri STRING,
-word STRING, 
-ml_score FLOAT,
-uri_rep STRING,
-respcode_name STRING,
-network_context STRING 
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/proxy/hive/oa/suspicious';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.proxy_storyboard ( 
-p_threat STRING, 
-title STRING,
-text STRING
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/proxy/hive/oa/storyboard';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.proxy_threat_investigation ( 
-tdate STRING,
-fulluri STRING,
-uri_sev INT
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/proxy/hive/oa/threat_investigation';
-
-
-CREATE EXTERNAL TABLE ${var:dbname}.proxy_timeline ( 
-p_threat STRING, 
-tstart STRING, 
-tend STRING, 
-duration BIGINT, 
-clientip STRING, 
-respcode STRING, 
-respcodename STRING
-)
-PARTITIONED BY ( 
-y SMALLINT,
-m TINYINT,
-d TINYINT
-)
-STORED AS PARQUET
-LOCATION '${var:huser}/proxy/hive/oa/timeline';

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3383c07c/spot-setup/hive/create_dns_parquet.hql
----------------------------------------------------------------------
diff --git a/spot-setup/hive/create_dns_parquet.hql b/spot-setup/hive/create_dns_parquet.hql
new file mode 100755
index 0000000..8e31ed3
--- /dev/null
+++ b/spot-setup/hive/create_dns_parquet.hql
@@ -0,0 +1,165 @@
+
+-- Licensed to the Apache Software Foundation (ASF) under one or more
+-- contributor license agreements.  See the NOTICE file distributed with
+-- this work for additional information regarding copyright ownership.
+-- The ASF licenses this file to You under the Apache License, Version 2.0
+-- (the "License"); you may not use this file except in compliance with
+-- the License.  You may obtain a copy of the License at
+
+--    http://www.apache.org/licenses/LICENSE-2.0
+
+-- Unless required by applicable law or agreed to in writing, software
+-- distributed under the License is distributed on an "AS IS" BASIS,
+-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+-- See the License for the specific language governing permissions and
+-- limitations under the License.
+
+SET hiveconf:huser;
+SET hiveconf:dbname;
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.dns (
+frame_time STRING, 
+unix_tstamp BIGINT,
+frame_len INT,
+ip_dst STRING,
+ip_src STRING,
+dns_qry_name STRING,
+dns_qry_class STRING,
+dns_qry_type INT,
+dns_qry_rcode INT,
+dns_a STRING
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT,
+h TINYINT
+)
+STORED AS PARQUET 
+LOCATION '${hiveconf:huser}/dns/hive';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.dns_dendro (
+unix_tstamp BIGINT,
+dns_a STRING,
+dns_qry_name STRING,
+ip_dst STRING
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/dns/hive/oa/dendro';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.dns_edge (
+unix_tstamp BIGINT,
+frame_len BIGINT,
+ip_dst STRING,
+ip_src STRING,
+dns_qry_name STRING,
+dns_qry_class STRING,
+dns_qry_type INT,
+dns_qry_rcode INT,
+dns_a STRING,
+hh INT,
+dns_qry_class_name STRING,
+dns_qry_type_name STRING,
+dns_qry_rcode_name STRING,
+network_context STRING
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/dns/hive/oa/edge';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.dns_ingest_summary (
+tdate STRING,
+total BIGINT
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/dns/hive/oa/summary';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.dns_scores (
+frame_time STRING, 
+unix_tstamp BIGINT,
+frame_len BIGINT,
+ip_dst STRING, 
+dns_qry_name STRING, 
+dns_qry_class STRING,
+dns_qry_type INT,
+dns_qry_rcode INT, 
+ml_score FLOAT,
+tld STRING,
+query_rep STRING,
+hh INT,
+dns_qry_class_name STRING, 
+dns_qry_type_name STRING,
+dns_qry_rcode_name STRING, 
+network_context STRING 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/dns/hive/oa/suspicious';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.dns_storyboard (
+ip_threat STRING,
+dns_threat STRING, 
+title STRING,
+text STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/dns/hive/oa/storyboard';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.dns_threat_dendro (
+anchor STRING, 
+total BIGINT,
+dns_qry_name STRING, 
+ip_dst STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/dns/hive/oa/threat_dendro';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.dns_threat_investigation (
+unix_tstamp BIGINT,
+ip_dst STRING, 
+dns_qry_name STRING, 
+ip_sev INT,
+dns_sev INT
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/dns/hive/oa/threat_investigation';

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3383c07c/spot-setup/hive/create_flow_parquet.hql
----------------------------------------------------------------------
diff --git a/spot-setup/hive/create_flow_parquet.hql b/spot-setup/hive/create_flow_parquet.hql
new file mode 100755
index 0000000..034e194
--- /dev/null
+++ b/spot-setup/hive/create_flow_parquet.hql
@@ -0,0 +1,197 @@
+
+-- Licensed to the Apache Software Foundation (ASF) under one or more
+-- contributor license agreements.  See the NOTICE file distributed with
+-- this work for additional information regarding copyright ownership.
+-- The ASF licenses this file to You under the Apache License, Version 2.0
+-- (the "License"); you may not use this file except in compliance with
+-- the License.  You may obtain a copy of the License at
+
+--    http://www.apache.org/licenses/LICENSE-2.0
+
+-- Unless required by applicable law or agreed to in writing, software
+-- distributed under the License is distributed on an "AS IS" BASIS,
+-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+-- See the License for the specific language governing permissions and
+-- limitations under the License.
+
+SET hiveconf:huser;
+SET hiveconf:dbname;
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.flow (
+treceived STRING,
+unix_tstamp BIGINT,
+tryear INT,
+trmonth INT,
+trday INT,
+trhour INT,
+trminute INT,
+trsec INT,
+tdur FLOAT,
+sip STRING,
+dip STRING,
+sport INT,
+dport INT,
+proto STRING,
+flag STRING,
+fwd INT,
+stos INT,
+ipkt BIGINT,
+ibyt BIGINT,
+opkt BIGINT, 
+obyt BIGINT,
+input INT,
+output INT,
+sas INT,
+das INT,
+dtos INT,
+dir INT,
+rip STRING
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT,
+h TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/flow/hive';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.flow_chords (
+ip_threat STRING,
+srcip STRING,
+dstip STRING,
+ibyt BIGINT, 
+ipkt BIGINT
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/flow/hive/oa/chords';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.flow_edge (
+tstart STRING, 
+srcip STRING,
+dstip STRING,
+sport INT, 
+dport INT, 
+proto STRING,
+flags STRING,
+tos INT, 
+ibyt BIGINT, 
+ipkt BIGINT, 
+input BIGINT,
+output BIGINT, 
+rip STRING,
+obyt BIGINT, 
+opkt BIGINT, 
+hh INT,
+mn INT 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/flow/hive/oa/edge';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.flow_ingest_summary (
+tdate STRING,
+total BIGINT 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/flow/hive/oa/summary';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.flow_scores (
+tstart STRING, 
+srcip STRING,
+dstip STRING,
+sport INT, 
+dport INT, 
+proto STRING,
+ipkt INT,
+ibyt INT,
+opkt INT,
+obyt INT,
+ml_score FLOAT,
+rank INT,
+srcip_INTernal INT,
+dstip_INTernal INT,
+src_geoloc STRING, 
+dst_geoloc STRING, 
+src_domain STRING, 
+dst_domain STRING, 
+src_rep STRING,
+dst_rep STRING 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/flow/hive/oa/suspicious';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.flow_storyboard (
+ip_threat STRING,
+title STRING,
+text STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/flow/hive/oa/storyboard';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.flow_threat_investigation (
+tstart STRING,
+srcip STRING, 
+dstip STRING, 
+srcport INT,
+dstport INT,
+score INT 
+) 
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+) 
+STORED AS PARQUET 
+LOCATION '${hiveconf:huser}/flow/hive/oa/threat_investigation';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.flow_timeline (
+ip_threat STRING,
+tstart STRING, 
+tend STRING, 
+srcip STRING,
+dstip STRING,
+proto STRING,
+sport INT, 
+dport INT, 
+ipkt BIGINT, 
+ibyt BIGINT
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/flow/hive/oa/timeline';

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3383c07c/spot-setup/hive/create_proxy_parquet.hql
----------------------------------------------------------------------
diff --git a/spot-setup/hive/create_proxy_parquet.hql b/spot-setup/hive/create_proxy_parquet.hql
new file mode 100755
index 0000000..16d90c0
--- /dev/null
+++ b/spot-setup/hive/create_proxy_parquet.hql
@@ -0,0 +1,179 @@
+
+-- Licensed to the Apache Software Foundation (ASF) under one or more
+-- contributor license agreements.  See the NOTICE file distributed with
+-- this work for additional information regarding copyright ownership.
+-- The ASF licenses this file to You under the Apache License, Version 2.0
+-- (the "License"); you may not use this file except in compliance with
+-- the License.  You may obtain a copy of the License at
+
+--    http://www.apache.org/licenses/LICENSE-2.0
+
+-- Unless required by applicable law or agreed to in writing, software
+-- distributed under the License is distributed on an "AS IS" BASIS,
+-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+-- See the License for the specific language governing permissions and
+-- limitations under the License.
+
+SET hiveconf:huser;
+SET hiveconf:dbname;
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.proxy (
+p_date STRING,
+p_time STRING,
+clientip STRING,
+host STRING,
+reqmethod STRING,
+useragent STRING,
+resconttype STRING,
+duration INT,
+username STRING,
+authgroup STRING,
+exceptionid STRING,
+filterresult STRING,
+webcat STRING,
+referer STRING,
+respcode STRING,
+action STRING,
+urischeme STRING,
+uriport STRING,
+uripath STRING,
+uriquery STRING,
+uriextension STRING,
+serverip STRING,
+scbytes INT,
+csbytes INT,
+virusid STRING,
+bcappname STRING,
+bcappoper STRING,
+fulluri STRING
+)
+PARTITIONED BY (
+y STRING,
+m STRING,
+d STRING,
+h STRING
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/proxy/hive';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.proxy_edge (
+tdate STRING,
+time STRING, 
+clientip STRING, 
+host STRING, 
+webcat STRING, 
+respcode STRING, 
+reqmethod STRING,
+useragent STRING,
+resconttype STRING,
+referer STRING,
+uriport STRING,
+serverip STRING, 
+scbytes INT, 
+csbytes INT, 
+fulluri STRING,
+hh INT,
+respcode_name STRING 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/proxy/hive/oa/edge';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.proxy_ingest_summary (
+tdate STRING,
+total BIGINT 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/proxy/hive/oa/summary';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.proxy_scores (
+tdate STRING,
+time STRING, 
+clientip STRING, 
+host STRING, 
+reqmethod STRING,
+useragent STRING,
+resconttype STRING,
+duration INT,
+username STRING, 
+webcat STRING, 
+referer STRING,
+respcode INT,
+uriport INT, 
+uripath STRING,
+uriquery STRING, 
+serverip STRING, 
+scbytes INT, 
+csbytes INT, 
+fulluri STRING,
+word STRING, 
+ml_score FLOAT,
+uri_rep STRING,
+respcode_name STRING,
+network_context STRING 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/proxy/hive/oa/suspicious';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.proxy_storyboard (
+p_threat STRING, 
+title STRING,
+text STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/proxy/hive/oa/storyboard';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.proxy_threat_investigation (
+tdate STRING,
+fulluri STRING,
+uri_sev INT
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/proxy/hive/oa/threat_investigation';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:dbname}.proxy_timeline (
+p_threat STRING, 
+tstart STRING, 
+tend STRING, 
+duration BIGINT, 
+clientip STRING, 
+respcode STRING, 
+respcodename STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${hiveconf:huser}/proxy/hive/oa/timeline';

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3383c07c/spot-setup/impala/create_dns_parquet.hql
----------------------------------------------------------------------
diff --git a/spot-setup/impala/create_dns_parquet.hql b/spot-setup/impala/create_dns_parquet.hql
new file mode 100755
index 0000000..274ea9d
--- /dev/null
+++ b/spot-setup/impala/create_dns_parquet.hql
@@ -0,0 +1,163 @@
+
+-- Licensed to the Apache Software Foundation (ASF) under one or more
+-- contributor license agreements.  See the NOTICE file distributed with
+-- this work for additional information regarding copyright ownership.
+-- The ASF licenses this file to You under the Apache License, Version 2.0
+-- (the "License"); you may not use this file except in compliance with
+-- the License.  You may obtain a copy of the License at
+
+--    http://www.apache.org/licenses/LICENSE-2.0
+
+-- Unless required by applicable law or agreed to in writing, software
+-- distributed under the License is distributed on an "AS IS" BASIS,
+-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+-- See the License for the specific language governing permissions and
+-- limitations under the License.
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.dns (
+frame_time STRING, 
+unix_tstamp BIGINT,
+frame_len INT,
+ip_dst STRING,
+ip_src STRING,
+dns_qry_name STRING,
+dns_qry_class STRING,
+dns_qry_type INT,
+dns_qry_rcode INT,
+dns_a STRING
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT,
+h TINYINT
+)
+STORED AS PARQUET 
+LOCATION '${var:huser}/dns/hive';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.dns_dendro (
+unix_tstamp BIGINT,
+dns_a STRING,
+dns_qry_name STRING,
+ip_dst STRING
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/dns/hive/oa/dendro';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.dns_edge (
+unix_tstamp BIGINT,
+frame_len BIGINT,
+ip_dst STRING,
+ip_src STRING,
+dns_qry_name STRING,
+dns_qry_class STRING,
+dns_qry_type INT,
+dns_qry_rcode INT,
+dns_a STRING,
+hh INT,
+dns_qry_class_name STRING,
+dns_qry_type_name STRING,
+dns_qry_rcode_name STRING,
+network_context STRING
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/dns/hive/oa/edge';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.dns_ingest_summary (
+tdate STRING,
+total BIGINT
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/dns/hive/oa/summary';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.dns_scores (
+frame_time STRING, 
+unix_tstamp BIGINT,
+frame_len BIGINT,
+ip_dst STRING, 
+dns_qry_name STRING, 
+dns_qry_class STRING,
+dns_qry_type INT,
+dns_qry_rcode INT, 
+ml_score FLOAT,
+tld STRING,
+query_rep STRING,
+hh INT,
+dns_qry_class_name STRING, 
+dns_qry_type_name STRING,
+dns_qry_rcode_name STRING, 
+network_context STRING 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/dns/hive/oa/suspicious';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.dns_storyboard (
+ip_threat STRING,
+dns_threat STRING, 
+title STRING,
+text STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/dns/hive/oa/storyboard';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.dns_threat_dendro (
+anchor STRING, 
+total BIGINT,
+dns_qry_name STRING, 
+ip_dst STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/dns/hive/oa/threat_dendro';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.dns_threat_investigation (
+unix_tstamp BIGINT,
+ip_dst STRING, 
+dns_qry_name STRING, 
+ip_sev INT,
+dns_sev INT
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/dns/hive/oa/threat_investigation';

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3383c07c/spot-setup/impala/create_flow_parquet.hql
----------------------------------------------------------------------
diff --git a/spot-setup/impala/create_flow_parquet.hql b/spot-setup/impala/create_flow_parquet.hql
new file mode 100755
index 0000000..c8d3481
--- /dev/null
+++ b/spot-setup/impala/create_flow_parquet.hql
@@ -0,0 +1,195 @@
+
+-- Licensed to the Apache Software Foundation (ASF) under one or more
+-- contributor license agreements.  See the NOTICE file distributed with
+-- this work for additional information regarding copyright ownership.
+-- The ASF licenses this file to You under the Apache License, Version 2.0
+-- (the "License"); you may not use this file except in compliance with
+-- the License.  You may obtain a copy of the License at
+
+--    http://www.apache.org/licenses/LICENSE-2.0
+
+-- Unless required by applicable law or agreed to in writing, software
+-- distributed under the License is distributed on an "AS IS" BASIS,
+-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+-- See the License for the specific language governing permissions and
+-- limitations under the License.
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.flow (
+treceived STRING,
+unix_tstamp BIGINT,
+tryear INT,
+trmonth INT,
+trday INT,
+trhour INT,
+trminute INT,
+trsec INT,
+tdur FLOAT,
+sip STRING,
+dip STRING,
+sport INT,
+dport INT,
+proto STRING,
+flag STRING,
+fwd INT,
+stos INT,
+ipkt BIGINT,
+ibyt BIGINT,
+opkt BIGINT, 
+obyt BIGINT,
+input INT,
+output INT,
+sas INT,
+das INT,
+dtos INT,
+dir INT,
+rip STRING
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT,
+h TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/flow/hive';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.flow_chords (
+ip_threat STRING,
+srcip STRING,
+dstip STRING,
+ibyt BIGINT, 
+ipkt BIGINT
+)
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/flow/hive/oa/chords';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.flow_edge (
+tstart STRING, 
+srcip STRING,
+dstip STRING,
+sport INT, 
+dport INT, 
+proto STRING,
+flags STRING,
+tos INT, 
+ibyt BIGINT, 
+ipkt BIGINT, 
+input BIGINT,
+output BIGINT, 
+rip STRING,
+obyt BIGINT, 
+opkt BIGINT, 
+hh INT,
+mn INT 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/flow/hive/oa/edge';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.flow_ingest_summary (
+tdate STRING,
+total BIGINT 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/flow/hive/oa/summary';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.flow_scores (
+tstart STRING, 
+srcip STRING,
+dstip STRING,
+sport INT, 
+dport INT, 
+proto STRING,
+ipkt INT,
+ibyt INT,
+opkt INT,
+obyt INT,
+ml_score FLOAT,
+rank INT,
+srcip_INTernal INT,
+dstip_INTernal INT,
+src_geoloc STRING, 
+dst_geoloc STRING, 
+src_domain STRING, 
+dst_domain STRING, 
+src_rep STRING,
+dst_rep STRING 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/flow/hive/oa/suspicious';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.flow_storyboard (
+ip_threat STRING,
+title STRING,
+text STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/flow/hive/oa/storyboard';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.flow_threat_investigation (
+tstart STRING,
+srcip STRING, 
+dstip STRING, 
+srcport INT,
+dstport INT,
+score INT 
+) 
+PARTITIONED BY (
+y SMALLINT,
+m TINYINT,
+d TINYINT
+) 
+STORED AS PARQUET 
+LOCATION '${var:huser}/flow/hive/oa/threat_investigation';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.flow_timeline (
+ip_threat STRING,
+tstart STRING, 
+tend STRING, 
+srcip STRING,
+dstip STRING,
+proto STRING,
+sport INT, 
+dport INT, 
+ipkt BIGINT, 
+ibyt BIGINT
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/flow/hive/oa/timeline';

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3383c07c/spot-setup/impala/create_proxy_parquet.hql
----------------------------------------------------------------------
diff --git a/spot-setup/impala/create_proxy_parquet.hql b/spot-setup/impala/create_proxy_parquet.hql
new file mode 100755
index 0000000..ddf3283
--- /dev/null
+++ b/spot-setup/impala/create_proxy_parquet.hql
@@ -0,0 +1,177 @@
+
+-- Licensed to the Apache Software Foundation (ASF) under one or more
+-- contributor license agreements.  See the NOTICE file distributed with
+-- this work for additional information regarding copyright ownership.
+-- The ASF licenses this file to You under the Apache License, Version 2.0
+-- (the "License"); you may not use this file except in compliance with
+-- the License.  You may obtain a copy of the License at
+
+--    http://www.apache.org/licenses/LICENSE-2.0
+
+-- Unless required by applicable law or agreed to in writing, software
+-- distributed under the License is distributed on an "AS IS" BASIS,
+-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+-- See the License for the specific language governing permissions and
+-- limitations under the License.
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.proxy (
+p_date STRING,
+p_time STRING,
+clientip STRING,
+host STRING,
+reqmethod STRING,
+useragent STRING,
+resconttype STRING,
+duration INT,
+username STRING,
+authgroup STRING,
+exceptionid STRING,
+filterresult STRING,
+webcat STRING,
+referer STRING,
+respcode STRING,
+action STRING,
+urischeme STRING,
+uriport STRING,
+uripath STRING,
+uriquery STRING,
+uriextension STRING,
+serverip STRING,
+scbytes INT,
+csbytes INT,
+virusid STRING,
+bcappname STRING,
+bcappoper STRING,
+fulluri STRING
+)
+PARTITIONED BY (
+y STRING,
+m STRING,
+d STRING,
+h STRING
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/proxy/hive';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.proxy_edge (
+tdate STRING,
+time STRING, 
+clientip STRING, 
+host STRING, 
+webcat STRING, 
+respcode STRING, 
+reqmethod STRING,
+useragent STRING,
+resconttype STRING,
+referer STRING,
+uriport STRING,
+serverip STRING, 
+scbytes INT, 
+csbytes INT, 
+fulluri STRING,
+hh INT,
+respcode_name STRING 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/proxy/hive/oa/edge';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.proxy_ingest_summary (
+tdate STRING,
+total BIGINT 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/proxy/hive/oa/summary';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.proxy_scores (
+tdate STRING,
+time STRING, 
+clientip STRING, 
+host STRING, 
+reqmethod STRING,
+useragent STRING,
+resconttype STRING,
+duration INT,
+username STRING, 
+webcat STRING, 
+referer STRING,
+respcode INT,
+uriport INT, 
+uripath STRING,
+uriquery STRING, 
+serverip STRING, 
+scbytes INT, 
+csbytes INT, 
+fulluri STRING,
+word STRING, 
+ml_score FLOAT,
+uri_rep STRING,
+respcode_name STRING,
+network_context STRING 
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/proxy/hive/oa/suspicious';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.proxy_storyboard (
+p_threat STRING, 
+title STRING,
+text STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/proxy/hive/oa/storyboard';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.proxy_threat_investigation (
+tdate STRING,
+fulluri STRING,
+uri_sev INT
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/proxy/hive/oa/threat_investigation';
+
+
+CREATE EXTERNAL TABLE IF NOT EXISTS ${var:dbname}.proxy_timeline (
+p_threat STRING, 
+tstart STRING, 
+tend STRING, 
+duration BIGINT, 
+clientip STRING, 
+respcode STRING, 
+respcodename STRING
+)
+PARTITIONED BY ( 
+y SMALLINT,
+m TINYINT,
+d TINYINT
+)
+STORED AS PARQUET
+LOCATION '${var:huser}/proxy/hive/oa/timeline';


[24/42] incubator-spot git commit: [SPOT-213][SPOT-223] attempt to fix kerberos authenticate issue with kerberos.py

Posted by na...@apache.org.
[SPOT-213][SPOT-223] attempt to fix kerberos authenticate issue with kerberos.py


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/d7b1d37e
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/d7b1d37e
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/d7b1d37e

Branch: refs/heads/SPOT-181_ODM
Commit: d7b1d37efc8ea23c35745dbd2fc32d1d5a69854f
Parents: 1582c4c
Author: natedogs911 <na...@gmail.com>
Authored: Fri Jan 19 09:43:05 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Fri Jan 19 09:43:05 2018 -0800

----------------------------------------------------------------------
 spot-ingest/common/kerberos.py | 42 +++++++++++++++++++++----------------
 1 file changed, 24 insertions(+), 18 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/d7b1d37e/spot-ingest/common/kerberos.py
----------------------------------------------------------------------
diff --git a/spot-ingest/common/kerberos.py b/spot-ingest/common/kerberos.py
index 1cdca78..95baef9 100755
--- a/spot-ingest/common/kerberos.py
+++ b/spot-ingest/common/kerberos.py
@@ -17,31 +17,37 @@
 # limitations under the License.
 #
 
-import os
-import subprocess
 import sys
+import os
+import common.configurator as config
+from common.utils import Util
 
-class Kerberos(object):
 
+class Kerberos(object):
     def __init__(self):
 
-        self._kinit =  os.getenv('KINITPATH')
-        self._kinitopts =  os.getenv('KINITOPTS')
-        self._keytab =  os.getenv('KEYTABPATH')
-        self._krb_user =  os.getenv('KRB_USER')
+        self._logger = Util.get_logger('SPOT.COMMON.KERBEROS')
+        principal, keytab, sasl_mech, security_proto = config.kerberos()
+
+        if os.getenv('KINITPATH'):
+            self._kinit = os.getenv('KINITPATH')
+        else:
+            self._kinit = "kinit"
+
+        self._kinitopts = os.getenv('KINITOPTS')
+        self._keytab = "-kt {0}".format(keytab)
+        self._krb_user = principal
 
-        if self._kinit == None or self._kinitopts == None or self._keytab == None or self._krb_user == None:
-            print "Please verify kerberos configuration, some environment variables are missing."
+        if self._kinit == None or self._keytab == None or self._krb_user == None:
+            self._logger.error("Please verify kerberos configuration, some environment variables are missing.")
             sys.exit(1)
 
-        self._kinit_args = [self._kinit,self._kinitopts,self._keytab,self._krb_user]
+        if self._kinitopts is None:
+            self._kinit_cmd = "{0} {1} {2}".format(self._kinit, self._keytab, self._krb_user)
+        else:
+            self._kinit_cmd = "{0} {1} {2} {3}".format(self._kinit, self._kinitopts, self._keytab, self._krb_user)
 
-	def authenticate(self):
+    def authenticate(self):
 
-		kinit = subprocess.Popen(self._kinit_args, stderr = subprocess.PIPE)
-		output,error = kinit.communicate()
-		if not kinit.returncode == 0:
-			if error:
-				print error.rstrip()
-				sys.exit(kinit.returncode)
-		print "Successfully authenticated!"
+        Util.execute_cmd(self._kinit_cmd, self._logger)
+        self._logger.info("Kerberos ticket obtained")


[14/42] incubator-spot git commit: Merge 'pr/127', apache/incubator-spot#127 aims to close SPOT-238

Posted by na...@apache.org.
Merge 'pr/127', apache/incubator-spot#127 aims to close SPOT-238


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/6deaae39
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/6deaae39
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/6deaae39

Branch: refs/heads/SPOT-181_ODM
Commit: 6deaae39b07d0c8b7b10da70e0239807bb401e5a
Parents: 2294bf4 3e7b628
Author: natedogs911 <na...@gmail.com>
Authored: Tue Jan 9 19:24:47 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Tue Jan 9 19:24:47 2018 -0800

----------------------------------------------------------------------
 spot-oa/oa/dns/dns_oa.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------



[35/42] incubator-spot git commit: PEP8 proxy_parser()

Posted by na...@apache.org.
PEP8 proxy_parser()


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/b9befd7b
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/b9befd7b
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/b9befd7b

Branch: refs/heads/SPOT-181_ODM
Commit: b9befd7b0a882a327c31a691a57c07a86a64ff31
Parents: 8ff0e47
Author: tpltnt <tp...@dropcut.net>
Authored: Thu Jan 25 11:45:07 2018 +0100
Committer: tpltnt <tp...@dropcut.net>
Committed: Thu Jan 25 11:45:07 2018 +0100

----------------------------------------------------------------------
 spot-ingest/pipelines/proxy/bluecoat.py | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/b9befd7b/spot-ingest/pipelines/proxy/bluecoat.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/proxy/bluecoat.py b/spot-ingest/pipelines/proxy/bluecoat.py
index 5667204..c2ddb04 100644
--- a/spot-ingest/pipelines/proxy/bluecoat.py
+++ b/spot-ingest/pipelines/proxy/bluecoat.py
@@ -118,25 +118,32 @@ def proxy_parser(proxy_fields):
     Parse and normalize data.
 
     :param proxy_fields: list with fields from log
-    :returns: list of str
+    :returns: list
     """
     proxy_parsed_data = []
 
     if len(proxy_fields) > 1:
 
         # create full URI.
-        proxy_uri_path =  proxy_fields[17] if  len(proxy_fields[17]) > 1 else ""
-        proxy_uri_qry =  proxy_fields[18] if  len(proxy_fields[18]) > 1 else ""
-        full_uri= "{0}{1}{2}".format(proxy_fields[15],proxy_uri_path,proxy_uri_qry)
+        proxy_uri_path = proxy_fields[17] if len(proxy_fields[17]) > 1 else ""
+        proxy_uri_qry = proxy_fields[18] if len(proxy_fields[18]) > 1 else ""
+        full_uri = "{0}{1}{2}".format(proxy_fields[15], proxy_uri_path, proxy_uri_qry)
         date = proxy_fields[0].split('-')
-        year =  date[0]
+        year = date[0]
         month = date[1].zfill(2)
         day = date[2].zfill(2)
         hour = proxy_fields[1].split(":")[0].zfill(2)
-        # re-order fields. 
-        proxy_parsed_data = [proxy_fields[0],proxy_fields[1],proxy_fields[3],proxy_fields[15],proxy_fields[12],proxy_fields[20],proxy_fields[13],int(proxy_fields[2]),proxy_fields[4],
-        proxy_fields[5],proxy_fields[6],proxy_fields[7],proxy_fields[8],proxy_fields[9],proxy_fields[10],proxy_fields[11],proxy_fields[14],proxy_fields[16],proxy_fields[17],proxy_fields[18],
-        proxy_fields[19],proxy_fields[21],int(proxy_fields[22]),int(proxy_fields[23]),proxy_fields[24],proxy_fields[25],proxy_fields[26],full_uri,year,month,day,hour ]
+        # re-order fields.
+        proxy_parsed_data = [proxy_fields[0], proxy_fields[1], proxy_fields[3],
+                             proxy_fields[15], proxy_fields[12], proxy_fields[20],
+                             proxy_fields[13], int(proxy_fields[2]), proxy_fields[4],
+                             proxy_fields[5], proxy_fields[6], proxy_fields[7],
+                             proxy_fields[8], proxy_fields[9], proxy_fields[10],
+                             proxy_fields[11], proxy_fields[14], proxy_fields[16],
+                             proxy_fields[17], proxy_fields[18], proxy_fields[19],
+                             proxy_fields[21], int(proxy_fields[22]), int(proxy_fields[23]),
+                             proxy_fields[24], proxy_fields[25], proxy_fields[26],
+                             full_uri, year, month, day, hour]
 
     return proxy_parsed_data
 


[19/42] incubator-spot git commit: [SPOT-213][SPOT-250][OA] update requirements for kerberos changes

Posted by na...@apache.org.
[SPOT-213][SPOT-250][OA] update requirements for kerberos changes


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/8b600c8f
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/8b600c8f
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/8b600c8f

Branch: refs/heads/SPOT-181_ODM
Commit: 8b600c8fcc2818223904a2d37f3177ccceb88811
Parents: afeb099
Author: natedogs911 <na...@gmail.com>
Authored: Thu Jan 18 11:10:40 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Thu Jan 18 11:10:40 2018 -0800

----------------------------------------------------------------------
 spot-oa/requirements.txt | 2 ++
 1 file changed, 2 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/8b600c8f/spot-oa/requirements.txt
----------------------------------------------------------------------
diff --git a/spot-oa/requirements.txt b/spot-oa/requirements.txt
index 1faa1b6..2596e64 100644
--- a/spot-oa/requirements.txt
+++ b/spot-oa/requirements.txt
@@ -24,3 +24,5 @@ setuptools>=3.4.4
 thrift==0.9.3
 impyla
 hdfs
+requests
+


[08/42] incubator-spot git commit: portable bash shebang

Posted by na...@apache.org.
portable bash shebang


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/f935f1e1
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/f935f1e1
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/f935f1e1

Branch: refs/heads/SPOT-181_ODM
Commit: f935f1e1dbdabdd4cddf93f5e9db60072e3e5311
Parents: dbf6f51
Author: tpltnt <tp...@dropcut.net>
Authored: Fri Dec 29 16:43:03 2017 +0100
Committer: tpltnt <tp...@dropcut.net>
Committed: Fri Dec 29 16:43:03 2017 +0100

----------------------------------------------------------------------
 spot-ingest/common/kafka_topic.sh      | 2 +-
 spot-ingest/start_ingest_standalone.sh | 2 +-
 spot-ml/ml_ops.sh                      | 2 +-
 spot-ml/ml_test.sh                     | 4 ++--
 spot-oa/runIpython.sh                  | 2 +-
 spot-setup/hdfs_setup.sh               | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/f935f1e1/spot-ingest/common/kafka_topic.sh
----------------------------------------------------------------------
diff --git a/spot-ingest/common/kafka_topic.sh b/spot-ingest/common/kafka_topic.sh
index ab95495..4c078c9 100755
--- a/spot-ingest/common/kafka_topic.sh
+++ b/spot-ingest/common/kafka_topic.sh
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 
 #
 # Licensed to the Apache Software Foundation (ASF) under one or more

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/f935f1e1/spot-ingest/start_ingest_standalone.sh
----------------------------------------------------------------------
diff --git a/spot-ingest/start_ingest_standalone.sh b/spot-ingest/start_ingest_standalone.sh
index 0e3bfd5..1a16612 100755
--- a/spot-ingest/start_ingest_standalone.sh
+++ b/spot-ingest/start_ingest_standalone.sh
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 
 #
 # Licensed to the Apache Software Foundation (ASF) under one or more

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/f935f1e1/spot-ml/ml_ops.sh
----------------------------------------------------------------------
diff --git a/spot-ml/ml_ops.sh b/spot-ml/ml_ops.sh
index dd00bbc..abe3d06 100755
--- a/spot-ml/ml_ops.sh
+++ b/spot-ml/ml_ops.sh
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 
 #
 # Licensed to the Apache Software Foundation (ASF) under one or more

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/f935f1e1/spot-ml/ml_test.sh
----------------------------------------------------------------------
diff --git a/spot-ml/ml_test.sh b/spot-ml/ml_test.sh
index 3036c93..7a4971a 100755
--- a/spot-ml/ml_test.sh
+++ b/spot-ml/ml_test.sh
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 
 #
 # Licensed to the Apache Software Foundation (ASF) under one or more
@@ -80,4 +80,4 @@ time spark-submit --class "org.apache.spot.SuspiciousConnects" \
   --ldabeta ${LDA_BETA} \
   --ldaoptimizer ${LDA_OPTIMIZER} \
   --precision ${PRECISION} \
-  $USER_DOMAIN_CMD
\ No newline at end of file
+  $USER_DOMAIN_CMD

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/f935f1e1/spot-oa/runIpython.sh
----------------------------------------------------------------------
diff --git a/spot-oa/runIpython.sh b/spot-oa/runIpython.sh
index 38a4121..26eaeff 100755
--- a/spot-oa/runIpython.sh
+++ b/spot-oa/runIpython.sh
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 #
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/f935f1e1/spot-setup/hdfs_setup.sh
----------------------------------------------------------------------
diff --git a/spot-setup/hdfs_setup.sh b/spot-setup/hdfs_setup.sh
index 86a26c0..df898c8 100755
--- a/spot-setup/hdfs_setup.sh
+++ b/spot-setup/hdfs_setup.sh
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 
 #
 # Licensed to the Apache Software Foundation (ASF) under one or more


[25/42] incubator-spot git commit: [SPOT-213][SPOT-77][SPOT-221] Update for spot-ingest to support Kerberos, implements hive client and Librdkafka support

Posted by na...@apache.org.
[SPOT-213][SPOT-77][SPOT-221] Update for spot-ingest to support Kerberos, implements hive client and Librdkafka support


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/41e51b8f
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/41e51b8f
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/41e51b8f

Branch: refs/heads/SPOT-181_ODM
Commit: 41e51b8fab0ba7ebccba10e8e3052c7131cb43dc
Parents: d7b1d37
Author: natedogs911 <na...@gmail.com>
Authored: Tue Jan 23 11:49:40 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Tue Jan 23 11:49:40 2018 -0800

----------------------------------------------------------------------
 spot-ingest/common/kafka_client.py       | 193 ++++++++++++++++++++------
 spot-ingest/master_collector.py          |  21 +--
 spot-ingest/pipelines/dns/collector.py   | 133 +++++++++++-------
 spot-ingest/pipelines/dns/worker.py      | 141 ++++++++++++++-----
 spot-ingest/pipelines/flow/collector.py  | 111 +++++++++------
 spot-ingest/pipelines/flow/worker.py     | 193 ++++++++++++++++++++------
 spot-ingest/pipelines/proxy/collector.py |   6 +-
 spot-ingest/worker.py                    |   6 +-
 8 files changed, 588 insertions(+), 216 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/41e51b8f/spot-ingest/common/kafka_client.py
----------------------------------------------------------------------
diff --git a/spot-ingest/common/kafka_client.py b/spot-ingest/common/kafka_client.py
index 977cb92..15441b2 100755
--- a/spot-ingest/common/kafka_client.py
+++ b/spot-ingest/common/kafka_client.py
@@ -19,23 +19,23 @@
 
 import logging
 import os
+import sys
 from common.utils import Util
-from kafka import KafkaProducer
-from kafka import KafkaConsumer as KC
-from kafka.partitioner.roundrobin import RoundRobinPartitioner
-from kafka.common import TopicPartition
+from confluent_kafka import Producer
+from confluent_kafka import Consumer
+import common.configurator as config
 
-class KafkaTopic(object):
 
+class KafkaProducer(object):
 
-    def __init__(self,topic,server,port,zk_server,zk_port,partitions):
+    def __init__(self, topic, server, port, zk_server, zk_port, partitions):
 
-        self._initialize_members(topic,server,port,zk_server,zk_port,partitions)
+        self._initialize_members(topic, server, port, zk_server, zk_port, partitions)
 
-    def _initialize_members(self,topic,server,port,zk_server,zk_port,partitions):
+    def _initialize_members(self, topic, server, port, zk_server, zk_port, partitions):
 
         # get logger isinstance
-        self._logger = logging.getLogger("SPOT.INGEST.KAFKA")
+        self._logger = logging.getLogger("SPOT.INGEST.KafkaProducer")
 
         # kafka requirements
         self._server = server
@@ -46,42 +46,93 @@ class KafkaTopic(object):
         self._num_of_partitions = partitions
         self._partitions = []
         self._partitioner = None
+        self._kafka_brokers = '{0}:{1}'.format(self._server, self._port)
 
         # create topic with partitions
         self._create_topic()
 
-    def _create_topic(self):
-
-        self._logger.info("Creating topic: {0} with {1} parititions".format(self._topic,self._num_of_partitions))     
+        self._kafka_conf = self._producer_config(self._kafka_brokers)
+
+        self._p = Producer(**self._kafka_conf)
+
+    def _producer_config(self, server):
+        # type: (str) -> dict
+        """Returns a configuration dictionary containing optional values"""
+
+        connection_conf = {
+            'bootstrap.servers': server,
+        }
+
+        if os.environ.get('KAFKA_DEBUG'):
+            connection_conf.update({'debug': 'all'})
+
+        if config.kerberos_enabled():
+            self._logger.info('Kerberos enabled')
+            principal, keytab, sasl_mech, security_proto = config.kerberos()
+            connection_conf.update({
+                'sasl.mechanisms': sasl_mech,
+                'security.protocol': security_proto,
+                'sasl.kerberos.principal': principal,
+                'sasl.kerberos.keytab': keytab,
+                'sasl.kerberos.min.time.before.relogin': 6000
+            })
+
+            sn = os.environ.get('KAFKA_SERVICE_NAME')
+            if sn:
+                self._logger.info('Setting Kerberos service name: ' + sn)
+                connection_conf.update({'sasl.kerberos.service.name': sn})
+
+            kinit_cmd = os.environ.get('KAFKA_KINIT')
+            if kinit_cmd:
+                self._logger.info('using kinit command: ' + kinit_cmd)
+                connection_conf.update({'sasl.kerberos.kinit.cmd': kinit_cmd})
+            else:
+                # Using -S %{sasl.kerberos.service.name}/%{broker.name} causes the ticket cache to refresh
+                # resulting in authentication errors for other services
+                connection_conf.update({
+                    'sasl.kerberos.kinit.cmd': 'kinit -S "%{sasl.kerberos.service.name}/%{broker.name}" -k -t "%{sasl.kerberos.keytab}" %{sasl.kerberos.principal}'
+                })
+
+        if config.ssl_enabled():
+            self._logger.info('Using SSL connection settings')
+            ssl_verify, ca_location, cert, key = config.ssl()
+            connection_conf.update({
+                'ssl.certificate.location': cert,
+                'ssl.ca.location': ca_location,
+                'ssl.key.location': key
+            })
+
+        return connection_conf
 
-        # Create partitions for the workers.
-        self._partitions = [ TopicPartition(self._topic,p) for p in range(int(self._num_of_partitions))]        
+    def _create_topic(self):
 
-        # create partitioner
-        self._partitioner = RoundRobinPartitioner(self._partitions)
+        self._logger.info("Creating topic: {0} with {1} parititions".format(self._topic, self._num_of_partitions))
         
         # get script path 
-        zk_conf = "{0}:{1}".format(self._zk_server,self._zk_port)
-        create_topic_cmd = "{0}/kafka_topic.sh create {1} {2} {3}".format(os.path.dirname(os.path.abspath(__file__)),self._topic,zk_conf,self._num_of_partitions)
+        zk_conf = "{0}:{1}".format(self._zk_server, self._zk_port)
+        create_topic_cmd = "{0}/kafka_topic.sh create {1} {2} {3}".format(
+            os.path.dirname(os.path.abspath(__file__)),
+            self._topic,
+            zk_conf,
+            self._num_of_partitions
+        )
 
         # execute create topic cmd
-        Util.execute_cmd(create_topic_cmd,self._logger)
+        Util.execute_cmd(create_topic_cmd, self._logger)
 
-    def send_message(self,message,topic_partition):
+    def SendMessage(self, message, topic):
+        p = self._p
+        p.produce(topic, message.encode('utf-8'), callback=self._delivery_callback)
+        p.poll(0)
+        p.flush(timeout=3600000)
 
-        self._logger.info("Sending message to: Topic: {0} Partition:{1}".format(self._topic,topic_partition))
-        kafka_brokers = '{0}:{1}'.format(self._server,self._port)             
-        producer = KafkaProducer(bootstrap_servers=[kafka_brokers],api_version_auto_timeout_ms=3600000)
-        future = producer.send(self._topic,message,partition=topic_partition)
-        producer.flush(timeout=3600000)
-        producer.close()
-    
     @classmethod
-    def SendMessage(cls,message,kafka_servers,topic,partition=0):
-        producer = KafkaProducer(bootstrap_servers=kafka_servers,api_version_auto_timeout_ms=3600000)
-        future = producer.send(topic,message,partition=partition)
-        producer.flush(timeout=3600000)
-        producer.close()  
+    def _delivery_callback(cls, err, msg):
+        if err:
+            sys.stderr.write('%% Message failed delivery: %s\n' % err)
+        else:
+            sys.stderr.write('%% Message delivered to %s [%d]\n' %
+                             (msg.topic(), msg.partition()))
 
     @property
     def Topic(self):
@@ -93,22 +144,24 @@ class KafkaTopic(object):
 
     @property
     def Zookeeper(self):
-        zk = "{0}:{1}".format(self._zk_server,self._zk_port)
+        zk = "{0}:{1}".format(self._zk_server, self._zk_port)
         return zk
 
     @property
     def BootstrapServers(self):
-        servers = "{0}:{1}".format(self._server,self._port) 
+        servers = "{0}:{1}".format(self._server, self._port)
         return servers
 
 
 class KafkaConsumer(object):
     
-    def __init__(self,topic,server,port,zk_server,zk_port,partition):
+    def __init__(self, topic, server, port, zk_server, zk_port, partition):
+
+        self._initialize_members(topic, server, port, zk_server, zk_port, partition)
 
-        self._initialize_members(topic,server,port,zk_server,zk_port,partition)
+    def _initialize_members(self, topic, server, port, zk_server, zk_port, partition):
 
-    def _initialize_members(self,topic,server,port,zk_server,zk_port,partition):
+        self._logger = logging.getLogger("SPOT.INGEST.KafkaConsumer")
 
         self._topic = topic
         self._server = server
@@ -116,14 +169,64 @@ class KafkaConsumer(object):
         self._zk_server = zk_server
         self._zk_port = zk_port
         self._id = partition
+        self._kafka_brokers = '{0}:{1}'.format(self._server, self._port)
+        self._kafka_conf = self._consumer_config(self._id, self._kafka_brokers)
+
+    def _consumer_config(self, groupid, server):
+        # type: (dict) -> dict
+        """Returns a configuration dictionary containing optional values"""
+
+        connection_conf = {
+            'bootstrap.servers': server,
+            'group.id': groupid,
+        }
+
+        if config.kerberos_enabled():
+            self._logger.info('Kerberos enabled')
+            principal, keytab, sasl_mech, security_proto = config.kerberos()
+            connection_conf.update({
+                'sasl.mechanisms': sasl_mech,
+                'security.protocol': security_proto,
+                'sasl.kerberos.principal': principal,
+                'sasl.kerberos.keytab': keytab,
+                'sasl.kerberos.min.time.before.relogin': 6000,
+                'default.topic.config': {
+                    'auto.commit.enable': 'true',
+                    'auto.commit.interval.ms': '60000',
+                    'auto.offset.reset': 'smallest'}
+            })
+
+            sn = os.environ.get('KAFKA_SERVICE_NAME')
+            if sn:
+                self._logger.info('Setting Kerberos service name: ' + sn)
+                connection_conf.update({'sasl.kerberos.service.name': sn})
+
+            kinit_cmd = os.environ.get('KAFKA_KINIT')
+            if kinit_cmd:
+                self._logger.info('using kinit command: ' + kinit_cmd)
+                connection_conf.update({'sasl.kerberos.kinit.cmd': kinit_cmd})
+            else:
+                # Using -S %{sasl.kerberos.service.name}/%{broker.name} causes the ticket cache to refresh
+                # resulting in authentication errors for other services
+                connection_conf.update({
+                    'sasl.kerberos.kinit.cmd': 'kinit -k -t "%{sasl.kerberos.keytab}" %{sasl.kerberos.principal}'
+                })
+
+        if config.ssl_enabled():
+            self._logger.info('Using SSL connection settings')
+            ssl_verify, ca_location, cert, key = config.ssl()
+            connection_conf.update({
+                'ssl.certificate.location': cert,
+                'ssl.ca.location': ca_location,
+                'ssl.key.location': key
+            })
+
+        return connection_conf
 
     def start(self):
-        
-        kafka_brokers = '{0}:{1}'.format(self._server,self._port)
-        consumer =  KC(bootstrap_servers=[kafka_brokers],group_id=self._topic)
-        partition = [TopicPartition(self._topic,int(self._id))]
-        consumer.assign(partitions=partition)
-        consumer.poll()
+
+        consumer = Consumer(**self._kafka_conf)
+        consumer.subscribe([self._topic])
         return consumer
 
     @property
@@ -132,6 +235,4 @@ class KafkaConsumer(object):
 
     @property
     def ZookeperServer(self):
-        return "{0}:{1}".format(self._zk_server,self._zk_port)
-
-    
+        return "{0}:{1}".format(self._zk_server, self._zk_port)

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/41e51b8f/spot-ingest/master_collector.py
----------------------------------------------------------------------
diff --git a/spot-ingest/master_collector.py b/spot-ingest/master_collector.py
index 6f6ff7c..23be9f4 100755
--- a/spot-ingest/master_collector.py
+++ b/spot-ingest/master_collector.py
@@ -24,14 +24,15 @@ import sys
 import datetime
 from common.utils import Util
 from common.kerberos import Kerberos
-from common.kafka_client import KafkaTopic
-
+import common.configurator as Config
+from common.kafka_client import KafkaProducer
 
 # get master configuration.
 SCRIPT_PATH = os.path.dirname(os.path.abspath(__file__))
 CONF_FILE = "{0}/ingest_conf.json".format(SCRIPT_PATH)
 MASTER_CONF = json.loads(open(CONF_FILE).read())
 
+
 def main():
 
     # input Parameters
@@ -49,6 +50,7 @@ def main():
     # start collector based on data source type.
     start_collector(args.type, args.workers_num, args.ingest_id)
 
+
 def start_collector(type, workers_num, id=None):
 
     # generate ingest id
@@ -68,7 +70,7 @@ def start_collector(type, workers_num, id=None):
         sys.exit(1)
 
     # validate if kerberos authentication is required.
-    if os.getenv('KRB_AUTH'):
+    if Config.kerberos_enabled():
         kb = Kerberos()
         kb.authenticate()
 
@@ -80,17 +82,20 @@ def start_collector(type, workers_num, id=None):
     # required zookeeper info.
     zk_server = MASTER_CONF["kafka"]['zookeper_server']
     zk_port = MASTER_CONF["kafka"]['zookeper_port']
-
-    topic = "SPOT-INGEST-{0}_{1}".format(type, ingest_id) if not id else id
-    kafka = KafkaTopic(topic, k_server, k_port, zk_server, zk_port, workers_num)
+         
+    topic = "{0}".format(type,ingest_id) if not id else id
+    producer = KafkaProducer(topic, k_server, k_port, zk_server, zk_port, workers_num)
 
     # create a collector instance based on data source type.
     logger.info("Starting {0} ingest instance".format(topic))
-    module = __import__("pipelines.{0}.collector".format(MASTER_CONF["pipelines"][type]["type"]), fromlist=['Collector'])
+    module = __import__("pipelines.{0}.collector".
+                        format(MASTER_CONF["pipelines"][type]["type"]),
+                        fromlist=['Collector'])
 
     # start collector.
-    ingest_collector = module.Collector(MASTER_CONF['hdfs_app_path'], kafka, type)
+    ingest_collector = module.Collector(MASTER_CONF['hdfs_app_path'], producer, type)
     ingest_collector.start()
 
+
 if __name__ == '__main__':
     main()

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/41e51b8f/spot-ingest/pipelines/dns/collector.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/dns/collector.py b/spot-ingest/pipelines/dns/collector.py
index c421c47..97c5ed6 100755
--- a/spot-ingest/pipelines/dns/collector.py
+++ b/spot-ingest/pipelines/dns/collector.py
@@ -18,26 +18,29 @@
 #
 
 import time
+import logging
 import os
-import subprocess
 import json
-import logging
 from multiprocessing import Process
 from common.utils import Util
+from common import hdfs_client as hdfs
+from common.hdfs_client import HdfsException
 from common.file_collector import FileWatcher
 from multiprocessing import Pool
-from common.kafka_client import KafkaTopic
+
 
 class Collector(object):
 
-    def __init__(self, hdfs_app_path, kafka_topic, conf_type):
-        self._initialize_members(hdfs_app_path, kafka_topic, conf_type)
+    def __init__(self, hdfs_app_path, kafkaproducer, conf_type):
+
+        self._initialize_members(hdfs_app_path, kafkaproducer, conf_type)
+
+    def _initialize_members(self, hdfs_app_path, kafkaproducer, conf_type):
 
-    def _initialize_members(self, hdfs_app_path, kafka_topic, conf_type):
         # getting parameters.
         self._logger = logging.getLogger('SPOT.INGEST.DNS')
         self._hdfs_app_path = hdfs_app_path
-        self._kafka_topic = kafka_topic
+        self._producer = kafkaproducer
 
         # get script path
         self._script_path = os.path.dirname(os.path.abspath(__file__))
@@ -64,6 +67,8 @@ class Collector(object):
         self._processes = conf["collector_processes"]
         self._ingestion_interval = conf["ingestion_interval"]
         self._pool = Pool(processes=self._processes)
+        # TODO: review re-use of hdfs.client
+        self._hdfs_client = hdfs.get_client()
 
     def start(self):
 
@@ -74,74 +79,108 @@ class Collector(object):
             while True:
                 self._ingest_files_pool()
                 time.sleep(self._ingestion_interval)
-
         except KeyboardInterrupt:
             self._logger.info("Stopping DNS collector...")
-            Util.remove_kafka_topic(self._kafka_topic.Zookeeper, self._kafka_topic.Topic, self._logger)
+            Util.remove_kafka_topic(self._producer.Zookeeper, self._producer.Topic, self._logger)
             self._watcher.stop()
             self._pool.terminate()
             self._pool.close()
             self._pool.join()
             SystemExit("Ingest finished...")
 
-
     def _ingest_files_pool(self):
+
         if self._watcher.HasFiles:
+
             for x in range(0, self._processes):
-                file = self._watcher.GetNextFile()
-                resutl = self._pool.apply_async(ingest_file, args=(file, self._pkt_num, self._pcap_split_staging, self._kafka_topic.Partition, self._hdfs_root_path, self._kafka_topic.Topic, self._kafka_topic.BootstrapServers, ))
-                #resutl.get() # to debug add try and catch.
-                if  not self._watcher.HasFiles: break    
+                self._logger.info('processes: {0}'.format(self._processes))
+                new_file = self._watcher.GetNextFile()
+                if self._processes <= 1:
+                    _ingest_file(
+                        self._hdfs_client,
+                        new_file,
+                        self._pkt_num,
+                        self._pcap_split_staging,
+                        self._hdfs_root_path,
+                        self._producer,
+                        self._producer.Topic
+                        )
+                else:
+                    resutl = self._pool.apply_async(_ingest_file, args=(
+                        self._hdfs_client,
+                        new_file,
+                        self._pkt_num,
+                        self._pcap_split_staging,
+                        self._hdfs_root_path,
+                        self._producer,
+                        self._producer.Topic
+                        ))
+                # resutl.get() # to debug add try and catch.
+                if not self._watcher.HasFiles:
+                    break
         return True
 
-def ingest_file(file,pkt_num,pcap_split_staging, partition,hdfs_root_path,topic,kafka_servers):
+
+def _ingest_file(hdfs_client, new_file, pkt_num, pcap_split_staging, hdfs_root_path, producer, topic):
 
     logger = logging.getLogger('SPOT.INGEST.DNS.{0}'.format(os.getpid()))
     
     try:
         # get file name and date.
-        org_file = file
-        file_name_parts = file.split('/')
+        org_file = new_file
+        file_name_parts = new_file.split('/')
         file_name = file_name_parts[len(file_name_parts)-1]
 
         # split file.
         name = file_name.split('.')[0]
-        split_cmd = "editcap -c {0} {1} {2}/{3}_spot.pcap".format(pkt_num,file,pcap_split_staging,name)
+        split_cmd = "editcap -c {0} {1} {2}/{3}_spot.pcap".format(pkt_num,
+                                                                  new_file,
+                                                                  pcap_split_staging,
+                                                                  name)
         logger.info("Splitting file: {0}".format(split_cmd))
         Util.execute_cmd(split_cmd,logger)
 
         logger.info("Removing file: {0}".format(org_file))
         rm_big_file = "rm {0}".format(org_file)
-        Util.execute_cmd(rm_big_file,logger)    
-
-        for currdir,subdir,files in os.walk(pcap_split_staging):
-            for file in files:
-                if file.endswith(".pcap") and "{0}_spot".format(name) in file:
-
-                        # get timestamp from the file name to build hdfs path.
-                        file_date = file.split('.')[0]
-                        pcap_hour = file_date[-6:-4]
-                        pcap_date_path = file_date[-14:-6]
-
-                        # hdfs path with timestamp.
-                        hdfs_path = "{0}/binary/{1}/{2}".format(hdfs_root_path,pcap_date_path,pcap_hour)
-
-                        # create hdfs path.
-                        Util.creat_hdfs_folder(hdfs_path,logger)
-
-                        # load file to hdfs.
-                        hadoop_pcap_file = "{0}/{1}".format(hdfs_path,file)
-                        Util.load_to_hdfs(os.path.join(currdir,file),hadoop_pcap_file,logger)
-
-                        # create event for workers to process the file.
-                        logger.info( "Sending split file to worker number: {0}".format(partition))
-                        KafkaTopic.SendMessage(hadoop_pcap_file,kafka_servers,topic,partition)
-                        logger.info("File {0} has been successfully sent to Kafka Topic to: {1}".format(file,topic))
-
+        Util.execute_cmd(rm_big_file,logger)
 
-  
     except Exception as err:
-        
-        logger.error("There was a problem, please check the following error message:{0}".format(err.message))
+        logger.error("There was a problem splitting the file: {0}".format(err.message))
         logger.error("Exception: {0}".format(err))
 
+    for currdir, subdir, files in os.walk(pcap_split_staging):
+        for file in files:
+            if file.endswith(".pcap") and "{0}_spot".format(name) in file:
+                # get timestamp from the file name to build hdfs path.
+                file_date = file.split('.')[0]
+                pcap_hour = file_date[-6:-4]
+                pcap_date_path = file_date[-14:-6]
+
+                # hdfs path with timestamp.
+                hdfs_path = "{0}/binary/{1}/{2}".format(hdfs_root_path, pcap_date_path, pcap_hour)
+
+                # create hdfs path.
+                try:
+                    if len(hdfs.list_dir(hdfs_path, hdfs_client)) == 0:
+                        logger.info('creating directory: ' + hdfs_path)
+                        hdfs_client.mkdir(hdfs_path, hdfs_client)
+
+                    # load file to hdfs.
+                    hadoop_pcap_file = "{0}/{1}".format(hdfs_path,file)
+                    result = hdfs_client.upload_file(hadoop_pcap_file, os.path.join(currdir,file))
+                    if not result:
+                        logger.error('File failed to upload: ' + hadoop_pcap_file)
+                        raise HdfsException
+
+                    # create event for workers to process the file.
+                    logger.info( "Sending split file to Topic: {0}".format(topic))
+                    producer.SendMessage(hadoop_pcap_file, topic)
+                    logger.info("File {0} has been successfully sent to Kafka Topic to: {1}".format(file,topic))
+
+                except HdfsException as err:
+                    logger.error('Exception: ' + err.exception)
+                    logger.info('Check Hdfs Connection settings and server health')
+
+                except Exception as err:
+                    logger.info("File {0} failed to be sent to Kafka Topic to: {1}".format(new_file,topic))
+                    logger.error("Error: {0}".format(err))
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/41e51b8f/spot-ingest/pipelines/dns/worker.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/dns/worker.py b/spot-ingest/pipelines/dns/worker.py
index 6f51f45..f23fa8f 100755
--- a/spot-ingest/pipelines/dns/worker.py
+++ b/spot-ingest/pipelines/dns/worker.py
@@ -21,18 +21,22 @@ import logging
 import datetime
 import subprocess
 import json
+import sys
 import os
 from multiprocessing import Process
 from common.utils import Util
+from common import hive_engine
+from common import hdfs_client as hdfs
+from confluent_kafka import KafkaError, KafkaException
 
 
 class Worker(object):
 
-    def __init__(self,db_name,hdfs_app_path,kafka_consumer,conf_type,processes=None):
+    def __init__(self, db_name, hdfs_app_path, kafka_consumer, conf_type, processes=None):
         
-        self._initialize_members(db_name,hdfs_app_path,kafka_consumer,conf_type)
+        self._initialize_members(db_name,hdfs_app_path, kafka_consumer, conf_type)
 
-    def _initialize_members(self,db_name,hdfs_app_path,kafka_consumer,conf_type):
+    def _initialize_members(self, db_name, hdfs_app_path, kafka_consumer, conf_type):
 
         # get logger instance.
         self._logger = Util.get_logger('SPOT.INGEST.WRK.DNS')
@@ -44,32 +48,58 @@ class Worker(object):
         self._script_path = os.path.dirname(os.path.abspath(__file__))
         conf_file = "{0}/ingest_conf.json".format(os.path.dirname(os.path.dirname(self._script_path)))
         conf = json.loads(open(conf_file).read())
-        self._conf = conf["pipelines"][conf_type] 
+        self._conf = conf["pipelines"][conf_type]
+        self._id = "spot-{0}-worker".format(conf_type)
 
         self._process_opt = self._conf['process_opt']
         self._local_staging = self._conf['local_staging']
         self.kafka_consumer = kafka_consumer
 
+        self._cursor = hive_engine.create_connection()
+
     def start(self):
 
         self._logger.info("Listening topic:{0}".format(self.kafka_consumer.Topic))
-        for message in self.kafka_consumer.start():
-            self._new_file(message.value)
-
-    def _new_file(self,file):
-
-        self._logger.info("-------------------------------------- New File received --------------------------------------")
+        consumer = self.kafka_consumer.start()
+
+        try:
+            while True:
+                message = consumer.poll(timeout=1.0)
+                if message is None:
+                    continue
+                if not message.error():
+                    self._new_file(message.value().decode('utf-8'))
+                elif message.error():
+                    if message.error().code() == KafkaError._PARTITION_EOF:
+                        continue
+                    elif message.error:
+                        raise KafkaException(message.error())
+
+        except KeyboardInterrupt:
+            sys.stderr.write('%% Aborted by user\n')
+
+        consumer.close()
+
+    def _new_file(self, nf):
+
+        self._logger.info(
+            "-------------------------------------- New File received --------------------------------------"
+        )
         self._logger.info("File: {0} ".format(file))        
-        p = Process(target=self._process_new_file, args=(file,))
+        p = Process(target=self._process_new_file, args=nf)
         p.start() 
         p.join()
 
-    def _process_new_file(self,file):
+    def _process_new_file(self, nf):
+
 
         # get file from hdfs
-        get_file_cmd = "hadoop fs -get {0} {1}.".format(file,self._local_staging)
-        self._logger.info("Getting file from hdfs: {0}".format(get_file_cmd))
-        Util.execute_cmd(get_file_cmd,self._logger)
+        self._logger.info("Getting file from hdfs: {0}".format(nf))
+        if hdfs.file_exists(nf):
+            hdfs.download_file(nf, self._local_staging)
+        else:
+            self._logger.info("file: {0} not found".format(nf))
+            # TODO: error handling
 
         # get file name and date
         file_name_parts = file.split('/')
@@ -82,37 +112,86 @@ class Worker(object):
         binary_day = binary_date_path[6:8]
 
         # build process cmd.
-        process_cmd = "tshark -r {0}{1} {2} > {0}{1}.csv".format(self._local_staging,file_name,self._process_opt)
+        process_cmd = "tshark -r {0}{1} {2} > {0}{1}.csv".format(self._local_staging, file_name, self._process_opt)
         self._logger.info("Processing file: {0}".format(process_cmd))
-        Util.execute_cmd(process_cmd,self._logger)
+        Util.execute_cmd(process_cmd, self._logger)
 
         # create hdfs staging.
         hdfs_path = "{0}/dns".format(self._hdfs_app_path)
         staging_timestamp = datetime.datetime.now().strftime('%M%S%f')[:-4]
         hdfs_staging_path =  "{0}/stage/{1}".format(hdfs_path,staging_timestamp)
-        create_staging_cmd = "hadoop fs -mkdir -p {0}".format(hdfs_staging_path)
-        self._logger.info("Creating staging: {0}".format(create_staging_cmd))
-        Util.execute_cmd(create_staging_cmd,self._logger)
+        self._logger.info("Creating staging: {0}".format(hdfs_staging_path))
+        hdfs.mkdir(hdfs_staging_path)
 
         # move to stage.
-        mv_to_staging ="hadoop fs -moveFromLocal {0}{1}.csv {2}/.".format(self._local_staging,file_name,hdfs_staging_path)
-        self._logger.info("Moving data to staging: {0}".format(mv_to_staging))
-        Util.execute_cmd(mv_to_staging,self._logger)
+        local_file = "{0}{1}.csv".format(self._local_staging, file_name)
+        self._logger.info("Moving data to staging: {0}".format(hdfs_staging_path))
+        hdfs.upload_file(hdfs_staging_path, local_file)
 
         #load to avro
-        load_to_avro_cmd = "hive -hiveconf dbname={0} -hiveconf y={1} -hiveconf m={2} -hiveconf d={3} -hiveconf h={4} -hiveconf data_location='{5}' -f pipelines/dns/load_dns_avro_parquet.hql".format(self._db_name,binary_year,binary_month,binary_day,binary_hour,hdfs_staging_path)
-
-        self._logger.info("Loading data to hive: {0}".format(load_to_avro_cmd))
-        Util.execute_cmd(load_to_avro_cmd,self._logger)
+        drop_table = 'DROP TABLE IF EXISTS {0}.dns_tmp'.format(self._db_name)
+        self._cursor.execute(drop_table)
+
+        # Create external table
+        create_external = ("\n"
+                           "CREATE EXTERNAL TABLE {0}.dns_tmp (\n"
+                           "  frame_day STRING,\n"
+                           "  frame_time STRING,\n"
+                           "  unix_tstamp BIGINT,\n"
+                           "  frame_len INT,\n"
+                           "  ip_src STRING,\n"
+                           "  ip_dst STRING,\n"
+                           "  dns_qry_name STRING,\n"
+                           "  dns_qry_type INT,\n"
+                           "  dns_qry_class STRING,\n"
+                           "  dns_qry_rcode INT,\n"
+                           "  dns_a STRING  \n"
+                           "  )\n"
+                           "  ROW FORMAT DELIMITED FIELDS TERMINATED BY ','\n"
+                           "  STORED AS TEXTFILE\n"
+                           "  LOCATION '{1}'\n"
+                           "  TBLPROPERTIES ('avro.schema.literal'='{{\n"
+                           "  \"type\":   \"record\"\n"
+                           "  , \"name\":   \"RawDnsRecord\"\n"
+                           "  , \"namespace\" : \"com.cloudera.accelerators.dns.avro\"\n"
+                           "  , \"fields\": [\n"
+                           "      {{\"name\": \"frame_day\",        \"type\":[\"string\", \"null\"]}\n"
+                           "      , {{\"name\": \"frame_time\",     \"type\":[\"string\", \"null\"]}\n"
+                           "      , {{\"name\": \"unix_tstamp\",    \"type\":[\"bigint\", \"null\"]}\n"
+                           "      , {{\"name\": \"frame_len\",      \"type\":[\"int\",    \"null\"]}\n"
+                           "      , {{\"name\": \"ip_src\",         \"type\":[\"string\", \"null\"]}\n"
+                           "      , {{\"name\": \"ip_dst\",         \"type\":[\"string\", \"null\"]}\n"
+                           "      , {{\"name\": \"dns_qry_name\",   \"type\":[\"string\", \"null\"]}\n"
+                           "      , {{\"name\": \"dns_qry_type\",   \"type\":[\"int\",    \"null\"]}\n"
+                           "      , {{\"name\": \"dns_qry_class\",  \"type\":[\"string\", \"null\"]}\n"
+                           "      , {{\"name\": \"dns_qry_rcode\",  \"type\":[\"int\",    \"null\"]}\n"
+                           "      , {{\"name\": \"dns_a\",          \"type\":[\"string\", \"null\"]}\n"
+                           "      ]\n"
+                           "}')\n"
+                           ).format(self._db_name, hdfs_staging_path)
+        self._logger.info( "Creating external table: {0}".format(create_external))
+        self._cursor.execute(create_external)
+
+        # Insert data
+        insert_into_table = """
+            INSERT INTO TABLE {0}.dns
+            PARTITION (y={1}, m={2}, d={3}, h={4)
+            SELECT   CONCAT(frame_day , frame_time) as treceived, unix_tstamp, frame_len, ip_dst, ip_src, dns_qry_name,
+            dns_qry_class,dns_qry_type, dns_qry_rcode, dns_a 
+            FROM {0}.dns_tmp
+        """.format(self._db_name,binary_year,binary_month,binary_day,binary_hour)
+        self._logger.info( "Loading data to {0}: {1}"
+                           .format(self._db_name, insert_into_table)
+                           )
+        self._cursor.execute(insert_into_table)
 
         # remove from hdfs staging
-        rm_hdfs_staging_cmd = "hadoop fs -rm -R -skipTrash {0}".format(hdfs_staging_path)
-        self._logger.info("Removing staging path: {0}".format(rm_hdfs_staging_cmd))
-        Util.execute_cmd(rm_hdfs_staging_cmd,self._logger)
+        self._logger.info("Removing staging path: {0}".format(hdfs_staging_path))
+        hdfs.delete_folder(hdfs_staging_path)
 
         # remove from local staging.
         rm_local_staging = "rm {0}{1}".format(self._local_staging,file_name)
         self._logger.info("Removing files from local staging: {0}".format(rm_local_staging))
-        Util.execute_cmd(rm_local_staging,self._logger)
+        Util.execute_cmd(rm_local_staging, self._logger)
 
         self._logger.info("File {0} was successfully processed.".format(file_name))

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/41e51b8f/spot-ingest/pipelines/flow/collector.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/flow/collector.py b/spot-ingest/pipelines/flow/collector.py
index b9a97f2..5e5cd49 100755
--- a/spot-ingest/pipelines/flow/collector.py
+++ b/spot-ingest/pipelines/flow/collector.py
@@ -23,22 +23,24 @@ import os
 import json
 from multiprocessing import Process
 from common.utils import Util
+from common import hdfs_client as hdfs
+from common.hdfs_client import HdfsException
 from common.file_collector import FileWatcher
 from multiprocessing import Pool
-from common.kafka_client import KafkaTopic
+
 
 class Collector(object):
 
-    def __init__(self,hdfs_app_path,kafka_topic,conf_type):
+    def __init__(self, hdfs_app_path, kafkaproducer, conf_type):
         
-        self._initialize_members(hdfs_app_path,kafka_topic,conf_type)
+        self._initialize_members(hdfs_app_path, kafkaproducer, conf_type)
+
+    def _initialize_members(self, hdfs_app_path, kafkaproducer, conf_type):
 
-    def _initialize_members(self,hdfs_app_path,kafka_topic,conf_type):
-  
         # getting parameters.
         self._logger = logging.getLogger('SPOT.INGEST.FLOW')
         self._hdfs_app_path = hdfs_app_path
-        self._kafka_topic = kafka_topic
+        self._producer = kafkaproducer
 
         # get script path
         self._script_path = os.path.dirname(os.path.abspath(__file__))
@@ -62,6 +64,8 @@ class Collector(object):
         self._processes = conf["collector_processes"]
         self._ingestion_interval = conf["ingestion_interval"]
         self._pool = Pool(processes=self._processes)
+        # TODO: review re-use of hdfs.client
+        self._hdfs_client = hdfs.get_client()
 
     def start(self):
 
@@ -74,54 +78,83 @@ class Collector(object):
                 time.sleep(self._ingestion_interval)
         except KeyboardInterrupt:
             self._logger.info("Stopping FLOW collector...")  
-            Util.remove_kafka_topic(self._kafka_topic.Zookeeper,self._kafka_topic.Topic,self._logger)          
+            Util.remove_kafka_topic(self._producer.Zookeeper, self._producer.Topic, self._logger)
             self._watcher.stop()
             self._pool.terminate()
             self._pool.close()            
             self._pool.join()
             SystemExit("Ingest finished...")
-    
 
     def _ingest_files_pool(self):            
        
         if self._watcher.HasFiles:
             
-            for x in range(0,self._processes):
-                file = self._watcher.GetNextFile()
-                resutl = self._pool.apply_async(ingest_file,args=(file,self._kafka_topic.Partition,self._hdfs_root_path ,self._kafka_topic.Topic,self._kafka_topic.BootstrapServers,))
-                #resutl.get() # to debug add try and catch.
-                if  not self._watcher.HasFiles: break    
+            for x in range(0, self._processes):
+                self._logger.info('processes: {0}'.format(self._processes))
+                new_file = self._watcher.GetNextFile()
+                if self._processes <= 1:
+                    _ingest_file(
+                                 new_file,
+                                 self._hdfs_root_path,
+                                 self._producer,
+                                 self._producer.Topic
+                                 )
+                else:
+                    result = self._pool.apply_async(_ingest_file, args=(
+                        new_file,
+                        self._hdfs_root_path,
+                        self._producer,
+                        self._producer.Topic
+                    ))
+                    # result.get()  # to debug add try and catch.
+                if not self._watcher.HasFiles:
+                    break
         return True
-    
 
 
-def ingest_file(file,partition,hdfs_root_path,topic,kafka_servers):
+def _ingest_file(new_file, hdfs_root_path, producer, topic):
 
-        logger = logging.getLogger('SPOT.INGEST.FLOW.{0}'.format(os.getpid()))
-
-        try:
+    logger = logging.getLogger('SPOT.INGEST.FLOW.{0}'.format(os.getpid()))
 
-            # get file name and date.
-            file_name_parts = file.split('/')
-            file_name = file_name_parts[len(file_name_parts)-1]
-            file_date = file_name.split('.')[1]
+    try:
 
-            file_date_path = file_date[0:8]
-            file_date_hour = file_date[8:10]
+        # get file name and date.
+        file_name_parts = new_file.split('/')
+        file_name = file_name_parts[len(file_name_parts)-1]
+        file_date = file_name.split('.')[1]
+        file_date_path = file_date[0:8]
+        file_date_hour = file_date[8:10]
 
-            # hdfs path with timestamp.
-            hdfs_path = "{0}/binary/{1}/{2}".format(hdfs_root_path,file_date_path,file_date_hour)
-            Util.creat_hdfs_folder(hdfs_path,logger)
+        # hdfs path with timestamp.
+        hdfs_path = "{0}/binary/{1}/{2}".format(hdfs_root_path, file_date_path, file_date_hour)
+        hdfs_file = "{0}/{1}".format(hdfs_path, file_name)
 
-            # load to hdfs.
-            hdfs_file = "{0}/{1}".format(hdfs_path,file_name)
-            Util.load_to_hdfs(file,hdfs_file,logger)
-
-            # create event for workers to process the file.
-            logger.info("Sending file to worker number: {0}".format(partition))
-            KafkaTopic.SendMessage(hdfs_file,kafka_servers,topic,partition)    
-            logger.info("File {0} has been successfully sent to Kafka Topic to: {1}".format(file,topic))
-
-        except Exception as err:
-            logger.error("There was a problem, please check the following error message:{0}".format(err.message))
-            logger.error("Exception: {0}".format(err))
+        try:
+            if len(hdfs.list_dir(hdfs_path)) == 0:
+                logger.info('creating directory: ' + hdfs_path)
+                hdfs.mkdir(hdfs_path)
+            logger.info('uploading file to hdfs: ' + hdfs_file)
+            result = hdfs.upload_file(hdfs_path, new_file)
+            if not result:
+                logger.error('File failed to upload: ' + hdfs_file)
+                raise HdfsException
+            else:
+                rm_file = "rm {0}".format(new_file)
+                logger.info("Removing files from local staging: {0}".format(rm_file))
+                Util.execute_cmd(rm_file, logger)
+
+        except HdfsException as err:
+            logger.error('Exception: ' + err.exception)
+            logger.info('Check Hdfs Connection settings and server health')
+
+    except Exception as err:
+        logger.error("There was a problem, Exception: {0}".format(err))
+
+        # create event for workers to process the file.
+        # logger.info("Sending file to worker number: {0}".format(partition))
+    try:
+        producer.SendMessage(hdfs_file, topic)
+        logger.info("File {0} has been successfully sent to Kafka Topic to: {1}".format(hdfs_file, topic))
+    except Exception as err:
+        logger.info("File {0} failed to be sent to Kafka Topic to: {1}".format(hdfs_file, topic))
+        logger.error("Error: {0}".format(err))

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/41e51b8f/spot-ingest/pipelines/flow/worker.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/flow/worker.py b/spot-ingest/pipelines/flow/worker.py
index 1630022..bb957a5 100755
--- a/spot-ingest/pipelines/flow/worker.py
+++ b/spot-ingest/pipelines/flow/worker.py
@@ -22,17 +22,20 @@ import subprocess
 import datetime
 import logging
 import os
-import json 
+import json
 from multiprocessing import Process
 from common.utils import Util
+from common import hive_engine
+from common import hdfs_client as hdfs
+from confluent_kafka import KafkaError, KafkaException
 
 
 class Worker(object):
 
-    def __init__(self,db_name,hdfs_app_path,kafka_consumer,conf_type,processes=None):
-        self._initialize_members(db_name,hdfs_app_path,kafka_consumer,conf_type)
+    def __init__(self, db_name, hdfs_app_path, kafka_consumer, conf_type, processes=None):
+        self._initialize_members(db_name, hdfs_app_path, kafka_consumer, conf_type)
 
-    def _initialize_members(self,db_name,hdfs_app_path,kafka_consumer,conf_type):
+    def _initialize_members(self, db_name, hdfs_app_path, kafka_consumer, conf_type):
 
         # get logger instance.
         self._logger = Util.get_logger('SPOT.INGEST.WRK.FLOW')
@@ -45,76 +48,186 @@ class Worker(object):
         conf_file = "{0}/ingest_conf.json".format(os.path.dirname(os.path.dirname(self._script_path)))
         conf = json.loads(open(conf_file).read())
         self._conf = conf["pipelines"][conf_type]
+        self._id = "spot-{0}-worker".format(conf_type)
 
         self._process_opt = self._conf['process_opt']
         self._local_staging = self._conf['local_staging']
         self.kafka_consumer = kafka_consumer
 
+        # self._cursor = hive_engine.create_connection()
+        self._cursor = hive_engine
+
     def start(self):
 
         self._logger.info("Listening topic:{0}".format(self.kafka_consumer.Topic))
-        for message in self.kafka_consumer.start():
-            self._new_file(message.value)
-
-    def _new_file(self,file):
-
-        self._logger.info("-------------------------------------- New File received --------------------------------------")
-        self._logger.info("File: {0} ".format(file))        
-        p = Process(target=self._process_new_file, args=(file,))
+        consumer = self.kafka_consumer.start()
+        try:
+            while True:
+                message = consumer.poll(timeout=1.0)
+                if message is None:
+                    continue
+                if not message.error():
+                    self._new_file(message.value().decode('utf-8'))
+                elif message.error():
+                    if message.error().code() == KafkaError._PARTITION_EOF:
+                        continue
+                    elif message.error:
+                        raise KafkaException(message.error())
+
+        except KeyboardInterrupt:
+            sys.stderr.write('%% Aborted by user\n')
+
+        consumer.close()
+
+    def _new_file(self, nf):
+
+        self._logger.info(
+            "-------------------------------------- New File received --------------------------------------"
+        )
+        self._logger.info("File: {0} ".format(nf))
+
+        p = Process(target=self._process_new_file, args=(nf, ))
         p.start()
         p.join()
         
-    def _process_new_file(self,file):
-
-        # get file from hdfs
-        get_file_cmd = "hadoop fs -get {0} {1}.".format(file,self._local_staging)
-        self._logger.info("Getting file from hdfs: {0}".format(get_file_cmd))
-        Util.execute_cmd(get_file_cmd,self._logger)
+    def _process_new_file(self, nf):
 
         # get file name and date
-        file_name_parts = file.split('/')
+        file_name_parts = nf.split('/')
         file_name = file_name_parts[len(file_name_parts)-1]
-
+        nf_path = nf.rstrip(file_name)
         flow_date = file_name.split('.')[1]
         flow_year = flow_date[0:4]
         flow_month = flow_date[4:6]
         flow_day = flow_date[6:8]
         flow_hour = flow_date[8:10]
 
+        # get file from hdfs
+        if hdfs.file_exists(nf_path, file_name):
+            self._logger.info("Getting file from hdfs: {0}".format(nf))
+            hdfs.download_file(nf, self._local_staging)
+        else:
+            self._logger.info("file: {0} not found".format(nf))
+            # TODO: error handling
+
         # build process cmd.
-        process_cmd = "nfdump -o csv -r {0}{1} {2} > {0}{1}.csv".format(self._local_staging,file_name,self._process_opt)
+        sf = "{0}{1}.csv".format(self._local_staging,file_name)
+        process_cmd = "nfdump -o csv -r {0}{1} {2} > {3}".format(self._local_staging, file_name, self._process_opt, sf)
         self._logger.info("Processing file: {0}".format(process_cmd))
-        Util.execute_cmd(process_cmd,self._logger)        
+        Util.execute_cmd(process_cmd,self._logger)
 
         # create hdfs staging.
         hdfs_path = "{0}/flow".format(self._hdfs_app_path)
         staging_timestamp = datetime.datetime.now().strftime('%M%S%f')[:-4]
-        hdfs_staging_path =  "{0}/stage/{1}".format(hdfs_path,staging_timestamp)
-        create_staging_cmd = "hadoop fs -mkdir -p {0}".format(hdfs_staging_path)
-        self._logger.info("Creating staging: {0}".format(create_staging_cmd))
-        Util.execute_cmd(create_staging_cmd,self._logger)
+        hdfs_staging_path = "{0}/stage/{1}".format(hdfs_path,staging_timestamp)
+        self._logger.info("Creating staging: {0}".format(hdfs_staging_path))
+        hdfs.mkdir(hdfs_staging_path)
 
         # move to stage.
-        mv_to_staging ="hadoop fs -moveFromLocal {0}{1}.csv {2}/.".format(self._local_staging,file_name,hdfs_staging_path)
-        self._logger.info("Moving data to staging: {0}".format(mv_to_staging))
-        subprocess.call(mv_to_staging,shell=True)
-
-        #load to avro
-        load_to_avro_cmd = "hive -hiveconf dbname={0} -hiveconf y={1} -hiveconf m={2} -hiveconf d={3} -hiveconf h={4} -hiveconf data_location='{5}' -f pipelines/flow/load_flow_avro_parquet.hql".format(self._db_name,flow_year,flow_month,flow_day,flow_hour,hdfs_staging_path)
-
-        self._logger.info( "Loading data to hive: {0}".format(load_to_avro_cmd))
-        Util.execute_cmd(load_to_avro_cmd,self._logger)
+        local_file = "{0}{1}.csv".format(self._local_staging, file_name)
+        self._logger.info("Moving data to staging: {0}".format(hdfs_staging_path))
+        hdfs.upload_file(hdfs_staging_path, local_file)
+
+        # load with impyla
+        drop_table = "DROP TABLE IF EXISTS {0}.flow_tmp".format(self._db_name)
+        self._logger.info( "Dropping temp table: {0}".format(drop_table))
+        self._cursor.execute_query(drop_table)
+
+        create_external = ("\n"
+                           "CREATE EXTERNAL TABLE {0}.flow_tmp (\n"
+                           "  treceived STRING,\n"
+                           "  tryear INT,\n"
+                           "  trmonth INT,\n"
+                           "  trday INT,\n"
+                           "  trhour INT,\n"
+                           "  trminute INT,\n"
+                           "  trsec INT,\n"
+                           "  tdur FLOAT,\n"
+                           "  sip  STRING,\n"
+                           "  dip STRING,\n"
+                           "  sport INT,\n"
+                           "  dport INT,\n"
+                           "  proto STRING,\n"
+                           "  flag STRING,\n"
+                           "  fwd INT,\n"
+                           "  stos INT,\n"
+                           "  ipkt BIGINT,\n"
+                           "  ibyt BIGINT,\n"
+                           "  opkt BIGINT,\n"
+                           "  obyt BIGINT,\n"
+                           "  input INT,\n"
+                           "  output INT,\n"
+                           "  sas INT,\n"
+                           "  das INT,\n"
+                           "  dtos INT,\n"
+                           "  dir INT,\n"
+                           "  rip STRING\n"
+                           "  )\n"
+                           "  ROW FORMAT DELIMITED FIELDS TERMINATED BY ','\n"
+                           "  STORED AS TEXTFILE\n"
+                           "  LOCATION '{1}'\n"
+                           "  TBLPROPERTIES ('avro.schema.literal'='{{\n"
+                           "  \"type\":   \"record\"\n"
+                           "  , \"name\":   \"RawFlowRecord\"\n"
+                           "  , \"namespace\" : \"com.cloudera.accelerators.flows.avro\"\n"
+                           "  , \"fields\": [\n"
+                           "      {{\"name\": \"treceived\",             \"type\":[\"string\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"tryear\",              \"type\":[\"float\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"trmonth\",             \"type\":[\"float\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"trday\",               \"type\":[\"float\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"trhour\",              \"type\":[\"float\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"trminute\",            \"type\":[\"float\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"trsec\",               \"type\":[\"float\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"tdur\",                \"type\":[\"float\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"sip\",                \"type\":[\"string\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"sport\",                 \"type\":[\"int\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"dip\",                \"type\":[\"string\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"dport\",                 \"type\":[\"int\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"proto\",              \"type\":[\"string\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"flag\",               \"type\":[\"string\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"fwd\",                   \"type\":[\"int\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"stos\",                  \"type\":[\"int\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"ipkt\",               \"type\":[\"bigint\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"ibytt\",              \"type\":[\"bigint\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"opkt\",               \"type\":[\"bigint\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"obyt\",               \"type\":[\"bigint\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"input\",                 \"type\":[\"int\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"output\",                \"type\":[\"int\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"sas\",                   \"type\":[\"int\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"das\",                   \"type\":[\"int\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"dtos\",                  \"type\":[\"int\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"dir\",                   \"type\":[\"int\",   \"null\"]}}\n"
+                           "      ,  {{\"name\": \"rip\",                \"type\":[\"string\",   \"null\"]}}\n"
+                           "      ]\n"
+                           "}}')\n"
+                           ).format(self._db_name, hdfs_staging_path)
+        self._logger.info( "Creating external table: {0}".format(create_external))
+        self._cursor.execute_query(create_external)
+
+        insert_into_table = """
+        INSERT INTO TABLE {0}.flow
+        PARTITION (y={1}, m={2}, d={3}, h={4})
+        SELECT   treceived,  unix_timestamp(treceived) AS unix_tstamp, tryear,  trmonth, trday,  trhour,  trminute,  trsec,
+          tdur,  sip, dip, sport, dport,  proto,  flag,  fwd,  stos,  ipkt,  ibyt,  opkt,  obyt,  input,  output,
+          sas,  das,  dtos,  dir,  rip
+        FROM {0}.flow_tmp
+        """.format(self._db_name, flow_year, flow_month, flow_day, flow_hour)
+        self._logger.info( "Loading data to {0}: {1}"
+                           .format(self._db_name, insert_into_table)
+                           )
+        self._cursor.execute_query(insert_into_table)
 
         # remove from hdfs staging
-        rm_hdfs_staging_cmd = "hadoop fs -rm -R -skipTrash {0}".format(hdfs_staging_path)
-        self._logger.info("Removing staging path: {0}".format(rm_hdfs_staging_cmd))
-        Util.execute_cmd(rm_hdfs_staging_cmd,self._logger)
+        self._logger.info("Removing staging path: {0}".format(hdfs_staging_path))
+        hdfs.delete_folder(hdfs_staging_path)
 
         # remove from local staging.
         rm_local_staging = "rm {0}{1}".format(self._local_staging,file_name)
         self._logger.info("Removing files from local staging: {0}".format(rm_local_staging))
         Util.execute_cmd(rm_local_staging,self._logger)
 
-        self._logger.info("File {0} was successfully processed.".format(file_name))
-
+        rm_local_staging = "rm {0}".format(sf)
+        self._logger.info("Removing files from local staging: {0}".format(rm_local_staging))
+        Util.execute_cmd(rm_local_staging,self._logger)
 
+        self._logger.info("File {0} was successfully processed.".format(file_name))

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/41e51b8f/spot-ingest/pipelines/proxy/collector.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/proxy/collector.py b/spot-ingest/pipelines/proxy/collector.py
index 69d708c..008b568 100644
--- a/spot-ingest/pipelines/proxy/collector.py
+++ b/spot-ingest/pipelines/proxy/collector.py
@@ -23,7 +23,7 @@ import os
 import sys
 import copy
 from common.utils import Util, NewFileEvent
-from common.kafka_client import KafkaTopic
+from common.kafka_client import KafkaProducer
 from multiprocessing import Pool
 from common.file_collector import FileWatcher
 import time
@@ -106,10 +106,10 @@ def ingest_file(file,message_size,topic,kafka_servers):
             for line in f:
                 message += line
                 if len(message) > message_size:
-                    KafkaTopic.SendMessage(message,kafka_servers,topic,0)
+                    KafkaProducer.SendMessage(message, kafka_servers, topic, 0)
                     message = ""
             #send the last package.        
-            KafkaTopic.SendMessage(message,kafka_servers,topic,0)            
+            KafkaProducer.SendMessage(message, kafka_servers, topic, 0)
         rm_file = "rm {0}".format(file)
         Util.execute_cmd(rm_file,logger)
         logger.info("File {0} has been successfully sent to Kafka Topic: {1}".format(file,topic))

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/41e51b8f/spot-ingest/worker.py
----------------------------------------------------------------------
diff --git a/spot-ingest/worker.py b/spot-ingest/worker.py
index 5c29148..ce758c5 100755
--- a/spot-ingest/worker.py
+++ b/spot-ingest/worker.py
@@ -24,11 +24,13 @@ import sys
 from common.utils import Util
 from common.kerberos import Kerberos
 from common.kafka_client import KafkaConsumer
+import common.configurator as Config
 
 SCRIPT_PATH = os.path.dirname(os.path.abspath(__file__))
 CONF_FILE = "{0}/ingest_conf.json".format(SCRIPT_PATH)
 WORKER_CONF = json.loads(open(CONF_FILE).read())
 
+
 def main():
 
     # input parameters
@@ -63,8 +65,8 @@ def start_worker(type, topic, id, processes=None):
         logger.error("The provided data source {0} is not valid".format(type))
         sys.exit(1)
 
-    # validate if kerberos authentication is requiered.
-    if os.getenv('KRB_AUTH'):
+    # validate if kerberos authentication is required.
+    if Config.kerberos_enabled():
         kb = Kerberos()
         kb.authenticate()
 


[06/42] incubator-spot git commit: Spot-196: Changes: - TopDomains - small refactoring, deleted unused variable. - Decoupled SpotLDAWrapper into different components for better modularity. - Modified Flow, DNS and Proxy SuspiciousConnectsModel objects to

Posted by na...@apache.org.
Spot-196: Changes:
- TopDomains - small refactoring, deleted unused variable.
- Decoupled SpotLDAWrapper into different components for better modularity.
- Modified Flow, DNS and Proxy SuspiciousConnectsModel objects to implement changes on SpotLDAWrapper
- Updated unit tests and implemented new ones to cover changes on SpotLDAWrapper


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/45c03ab6
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/45c03ab6
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/45c03ab6

Branch: refs/heads/SPOT-181_ODM
Commit: 45c03ab6ab97ffbd2a1228fb5dbd071511ce3c7b
Parents: 2ebe572
Author: Ricardo Barona <ri...@intel.com>
Authored: Thu Aug 3 14:02:13 2017 -0500
Committer: Ricardo Barona <ri...@intel.com>
Committed: Fri Oct 6 15:25:58 2017 -0500

----------------------------------------------------------------------
 .../dns/model/DNSSuspiciousConnectsModel.scala  |  43 ++--
 .../org/apache/spot/lda/SpotLDAHelper.scala     | 172 ++++++++++++++
 .../org/apache/spot/lda/SpotLDAModel.scala      | 140 +++++++++++
 .../org/apache/spot/lda/SpotLDAResult.scala     |  43 ++++
 .../org/apache/spot/lda/SpotLDAWrapper.scala    | 226 +++---------------
 .../model/FlowSuspiciousConnectsModel.scala     |  27 +--
 .../proxy/ProxySuspiciousConnectsModel.scala    |  25 +-
 .../org/apache/spot/utilities/TopDomains.scala  |   1 -
 .../org/apache/spot/lda/SpotLDAHelperTest.scala | 133 +++++++++++
 .../apache/spot/lda/SpotLDAWrapperTest.scala    | 238 ++++++-------------
 10 files changed, 636 insertions(+), 412 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/45c03ab6/spot-ml/src/main/scala/org/apache/spot/dns/model/DNSSuspiciousConnectsModel.scala
----------------------------------------------------------------------
diff --git a/spot-ml/src/main/scala/org/apache/spot/dns/model/DNSSuspiciousConnectsModel.scala b/spot-ml/src/main/scala/org/apache/spot/dns/model/DNSSuspiciousConnectsModel.scala
index dfcb543..7245acf 100644
--- a/spot-ml/src/main/scala/org/apache/spot/dns/model/DNSSuspiciousConnectsModel.scala
+++ b/spot-ml/src/main/scala/org/apache/spot/dns/model/DNSSuspiciousConnectsModel.scala
@@ -19,22 +19,18 @@ package org.apache.spot.dns.model
 
 import org.apache.log4j.Logger
 import org.apache.spark.broadcast.Broadcast
-import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types._
-import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession}
 import org.apache.spot.SuspiciousConnectsArgumentParser.SuspiciousConnectsConfig
 import org.apache.spot.dns.DNSSchema._
 import org.apache.spot.dns.DNSWordCreation
-import org.apache.spot.lda.SpotLDAWrapper
-import org.apache.spot.lda.SpotLDAWrapper._
 import org.apache.spot.lda.SpotLDAWrapperSchema._
+import org.apache.spot.lda._
 import org.apache.spot.utilities.DomainProcessor.DomainInfo
 import org.apache.spot.utilities._
 import org.apache.spot.utilities.data.validation.InvalidDataHandler
 
-import scala.util.{Failure, Success, Try}
-
 
 /**
   * A probabilistic model of the DNS queries issued by each client IP.
@@ -50,17 +46,17 @@ import scala.util.{Failure, Success, Try}
   *
   * Create these models using the  factory in the companion object.
   *
-  * @param inTopicCount          Number of topics to use in the topic model.
-  * @param inIpToTopicMix        Per-IP topic mix.
-  * @param inWordToPerTopicProb  Per-word,  an array of probability of word given topic per topic.
+  * @param inTopicCount         Number of topics to use in the topic model.
+  * @param inIpToTopicMix       Per-IP topic mix.
+  * @param inWordToPerTopicProb Per-word,  an array of probability of word given topic per topic.
   */
 class DNSSuspiciousConnectsModel(inTopicCount: Int,
                                  inIpToTopicMix: DataFrame,
                                  inWordToPerTopicProb: Map[String, Array[Double]]) {
 
-  val topicCount = inTopicCount
-  val ipToTopicMix = inIpToTopicMix
-  val wordToPerTopicProb = inWordToPerTopicProb
+  val topicCount: Int = inTopicCount
+  val ipToTopicMix: DataFrame = inIpToTopicMix
+  val wordToPerTopicProb: Map[String, Array[Double]] = inWordToPerTopicProb
 
   /**
     * Use a suspicious connects model to assign estimated probabilities to a dataframe of
@@ -128,7 +124,7 @@ object DNSSuspiciousConnectsModel {
     QueryTypeField,
     QueryResponseCodeField))
 
-  val modelColumns = ModelSchema.fieldNames.toList.map(col)
+  val modelColumns: List[Column] = ModelSchema.fieldNames.toList.map(col)
 
   val DomainStatsSchema = StructType(List(TopDomainField, SubdomainLengthField, SubdomainEntropyField, NumPeriodsField))
 
@@ -136,7 +132,7 @@ object DNSSuspiciousConnectsModel {
     * Create a new DNS Suspicious Connects model by training it on a data frame and a feedback file.
     *
     * @param sparkSession Spark Session
-    * @param logger
+    * @param logger       Application logger
     * @param config       Analysis configuration object containing CLI parameters.
     *                     Contains the path to the feedback file in config.scoresFile
     * @param inputRecords Data used to train the model.
@@ -155,7 +151,6 @@ object DNSSuspiciousConnectsModel {
       config.feedbackFile,
       config.duplicationFactor))
 
-    val countryCodesBC = sparkSession.sparkContext.broadcast(CountryCodes.CountryCodes)
     val topDomainsBC = sparkSession.sparkContext.broadcast(TopDomains.TopDomains)
     val userDomain = config.userDomain
 
@@ -175,19 +170,20 @@ object DNSSuspiciousConnectsModel {
         .reduceByKey(_ + _)
         .map({ case ((ipDst, word), count) => SpotLDAInput(ipDst, word, count) })
 
+    val spotLDAHelper: SpotLDAHelper = SpotLDAHelper(ipDstWordCounts, config.precisionUtility, sparkSession)
 
-    val SpotLDAOutput(ipToTopicMix, wordToPerTopicProb) = SpotLDAWrapper.runLDA(sparkSession,
-      ipDstWordCounts,
-      config.topicCount,
+    val model: SpotLDAModel = SpotLDAWrapper.run(config.topicCount,
       logger,
       config.ldaPRGSeed,
       config.ldaAlpha,
       config.ldaBeta,
       config.ldaOptimizer,
       config.ldaMaxiterations,
-      config.precisionUtility)
+      spotLDAHelper)
+
+    val results: SpotLDAResult = model.predict(spotLDAHelper)
 
-    new DNSSuspiciousConnectsModel(config.topicCount, ipToTopicMix, wordToPerTopicProb)
+    new DNSSuspiciousConnectsModel(config.topicCount, results.documentToTopicMix, results.wordToTopicMix)
 
   }
 
@@ -205,15 +201,16 @@ object DNSSuspiciousConnectsModel {
                        userDomain: String,
                        url: String): TempFields = {
 
-    val DomainInfo(_, topDomainClass, subdomain, subdomainLength, subdomainEntropy, numPeriods) =
+    val DomainInfo(_, topDomainClass, _, subDomainLength, subDomainEntropy, numPeriods) =
       DomainProcessor.extractDomainInfo(url, topDomainsBC, userDomain)
 
 
     TempFields(topDomainClass = topDomainClass,
-      subdomainLength = subdomainLength,
-      subdomainEntropy = subdomainEntropy,
+      subdomainLength = subDomainLength,
+      subdomainEntropy = subDomainEntropy,
       numPeriods = numPeriods)
   }
 
   case class TempFields(topDomainClass: Int, subdomainLength: Integer, subdomainEntropy: Double, numPeriods: Integer)
+
 }
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/45c03ab6/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAHelper.scala
----------------------------------------------------------------------
diff --git a/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAHelper.scala b/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAHelper.scala
new file mode 100644
index 0000000..8e771cb
--- /dev/null
+++ b/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAHelper.scala
@@ -0,0 +1,172 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spot.lda
+
+import org.apache.spark.mllib.linalg.{Matrix, Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.functions.udf
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spot.lda.SpotLDAWrapperSchema._
+import org.apache.spot.utilities.{FloatPointPrecisionUtility, FloatPointPrecisionUtility64}
+
+import scala.collection.immutable.Map
+
+/**
+  * Apache Spot routines to format Spark LDA input and output for scoring.
+  */
+class SpotLDAHelper(private final val sparkSession: SparkSession,
+                    final val docWordCount: RDD[SpotLDAInput],
+                    final val documentDictionary: DataFrame,
+                    final val wordDictionary: Map[String, Int],
+                    final val precisionUtility: FloatPointPrecisionUtility = FloatPointPrecisionUtility64) extends Serializable {
+
+  /**
+    * Format document word count as RDD[(Long, Vector)] - input data for LDA algorithm
+    *
+    * @return RDD[(Long, Vector)]
+    */
+  val formattedCorpus: RDD[(Long, Vector)] = {
+    import sparkSession.implicits._
+
+    val getWordId = {
+      udf((word: String) => wordDictionary(word))
+    }
+
+    val docWordCountDF = docWordCount
+      .map({ case SpotLDAInput(doc, word, count) => (doc, word, count) })
+      .toDF(DocumentName, WordName, WordNameWordCount)
+
+    // Convert SpotSparkLDAInput into desired format for Spark LDA: (doc, word, count) -> word count per doc, where RDD
+    // is indexed by DocID
+    val wordCountsPerDocDF = docWordCountDF
+      .join(documentDictionary, docWordCountDF(DocumentName) === documentDictionary(DocumentName))
+      .drop(documentDictionary(DocumentName))
+      .withColumn(WordNumber, getWordId(docWordCountDF(WordName)))
+      .drop(WordName)
+
+    val wordCountsPerDoc: RDD[(Long, Iterable[(Int, Double)])]
+    = wordCountsPerDocDF
+      .select(DocumentNumber, WordNumber, WordNameWordCount)
+      .rdd
+      .map({ case Row(documentId: Long, wordId: Int, wordCount: Int) => (documentId.toLong, (wordId, wordCount.toDouble)) })
+      .groupByKey
+
+    // Sum of distinct words in each doc (words will be repeated between different docs), used for sparse vec size
+    val numUniqueWords = wordDictionary.size
+    val ldaInput: RDD[(Long, Vector)] = wordCountsPerDoc
+      .mapValues(vs => Vectors.sparse(numUniqueWords, vs.toSeq))
+
+    ldaInput
+  }
+
+  /**
+    * Format LDA output topicDistribution for spot-ml scoring
+    *
+    * @param documentDistributions LDA model topicDistributions
+    * @return DataFrame
+    */
+  def formatDocumentDistribution(documentDistributions: RDD[(Long, Vector)]): DataFrame = {
+    import sparkSession.implicits._
+
+    val topicDistributionToArray = udf((topicDistribution: Vector) => topicDistribution.toArray)
+    val documentToTopicDistributionDF = documentDistributions.toDF(DocumentNumber, TopicProbabilityMix)
+
+    val documentToTopicDistributionArray = documentToTopicDistributionDF
+      .join(documentDictionary, documentToTopicDistributionDF(DocumentNumber) === documentDictionary(DocumentNumber))
+      .drop(documentDictionary(DocumentNumber))
+      .drop(documentToTopicDistributionDF(DocumentNumber))
+      .select(DocumentName, TopicProbabilityMix)
+      .withColumn(TopicProbabilityMixArray, topicDistributionToArray(documentToTopicDistributionDF(TopicProbabilityMix)))
+      .selectExpr(s"$DocumentName  AS $DocumentName", s"$TopicProbabilityMixArray AS $TopicProbabilityMix")
+
+    precisionUtility.castColumn(documentToTopicDistributionArray, TopicProbabilityMix)
+  }
+
+  /**
+    * Format LDA output topicMatrix for spot-ml scoring
+    *
+    * @param topicsMatrix LDA model topicMatrix
+    * @return Map[String, Array[Double]]
+    **/
+  def formatTopicDistributions(topicsMatrix: Matrix): Map[String, Array[Double]] = {
+    // Incoming word top matrix is in column-major order and the columns are unnormalized
+    val m = topicsMatrix.numRows
+    val n = topicsMatrix.numCols
+    val reverseWordDictionary = wordDictionary.map(_.swap)
+
+    val columnSums: Array[Double] = Range(0, n).map(j => Range(0, m).map(i => topicsMatrix(i, j)).sum).toArray
+
+    val wordProbabilities: Seq[Array[Double]] = topicsMatrix.transpose.toArray.grouped(n).toSeq
+      .map(unNormalizedProbabilities => unNormalizedProbabilities.zipWithIndex.map({ case (u, j) => u / columnSums(j) }))
+
+    wordProbabilities.zipWithIndex
+      .map({ case (topicProbabilities, wordInd) => (reverseWordDictionary(wordInd), topicProbabilities) }).toMap
+  }
+
+}
+
+object SpotLDAHelper {
+
+  /**
+    * Factory method for SpotLDAHelper new instance.
+    *
+    * @param docWordCount Document word count.
+    * @param precisionUtility
+    * @param sparkSession
+    * @return
+    */
+  def apply(docWordCount: RDD[SpotLDAInput],
+            precisionUtility: FloatPointPrecisionUtility,
+            sparkSession: SparkSession): SpotLDAHelper = {
+
+    import sparkSession.implicits._
+
+    val docWordCountCache = docWordCount.cache()
+
+    // Forcing an action to cache results.
+    docWordCountCache.count()
+
+    // Create word Map Word,Index for further usage
+    val wordDictionary: Map[String, Int] = {
+      val words = docWordCountCache
+        .map({ case SpotLDAInput(_, word, _) => word })
+        .distinct
+        .collect
+      words.zipWithIndex.toMap
+    }
+
+    val documentDictionary: DataFrame = docWordCountCache
+      .map({ case SpotLDAInput(doc, _, _) => doc })
+      .distinct
+      .zipWithIndex
+      .toDF(DocumentName, DocumentNumber)
+      .cache
+
+    new SpotLDAHelper(sparkSession, docWordCount, documentDictionary, wordDictionary, precisionUtility)
+  }
+
+}
+
+/**
+  * Spot LDA input case class
+  *
+  * @param doc   Document name.
+  * @param word  Word.
+  * @param count Times the word appears for the document.
+  */
+case class SpotLDAInput(doc: String, word: String, count: Int) extends Serializable

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/45c03ab6/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAModel.scala
----------------------------------------------------------------------
diff --git a/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAModel.scala b/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAModel.scala
new file mode 100644
index 0000000..669bb69
--- /dev/null
+++ b/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAModel.scala
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spot.lda
+
+import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDAModel, LocalLDAModel}
+import org.apache.spark.sql.SparkSession
+
+/**
+  * Spot LDAModel.
+  */
+sealed trait SpotLDAModel {
+
+  /**
+    * Save the model to HDFS
+    *
+    * @param sparkSession
+    * @param location
+    */
+  def save(sparkSession: SparkSession, location: String): Unit
+
+  /**
+    * Predict topicDistributions and get topicsMatrix along with results formatted for Apache Spot scoring
+    *
+    * @param helper
+    * @return
+    */
+  def predict(helper: SpotLDAHelper): SpotLDAResult
+}
+
+/**
+  * Spark LocalLDAModel wrapper.
+  *
+  * @param ldaModel Spark LDA Model
+  */
+class SpotLocalLDAModel(final val ldaModel: LDAModel) extends SpotLDAModel {
+
+  /**
+    * Save LocalLDAModel on HDFS location
+    *
+    * @param sparkSession the Spark session
+    * @param location     the HDFS location
+    */
+  override def save(sparkSession: SparkSession, location: String): Unit = {
+    val sparkContext = sparkSession.sparkContext
+
+    ldaModel.save(sparkContext, location)
+  }
+
+  /**
+    * Predict topicDistributions and get topicsMatrix along with results formatted for Apache Spot scoring.
+    * SpotLocalLDAModel.predict will use corpus from spotLDAHelper which can be a new set of documents or the same
+    * documents used for training.
+    *
+    * @param spotLDAHelper Spot LDA Helper object, can be the same used for training or a new instance with new
+    *                      documents.
+    * @return SpotLDAResult
+    */
+  override def predict(spotLDAHelper: SpotLDAHelper): SpotLDAResult = {
+
+    val localLDAModel: LocalLDAModel = ldaModel.asInstanceOf[LocalLDAModel]
+
+    val topicDistributions = localLDAModel.topicDistributions(spotLDAHelper.formattedCorpus)
+    val topicMix = localLDAModel.topicsMatrix
+
+    SpotLDAResult(spotLDAHelper, topicDistributions, topicMix)
+  }
+}
+
+/** Spark DistributedLDAModel wrapper.
+  * Ideally, this model should be used only for batch processing.
+  *
+  * @param ldaModel Spark LDA Model
+  */
+class SpotDistributedLDAModel(final val ldaModel: LDAModel) extends
+  SpotLDAModel {
+
+  /**
+    * Save DistributedLDAModel on HDFS location
+    *
+    * @param sparkSession the Spark session
+    * @param location     the HDFS location
+    */
+  override def save(sparkSession: SparkSession, location: String): Unit = {
+    val sparkContext = sparkSession.sparkContext
+
+    ldaModel.save(sparkContext, location)
+  }
+
+  /**
+    * Predict topicDistributions and get topicsMatrix along with results formatted for Apache Spot scoring.
+    * SpotDistributeLDAModel.predict will use same documents that were used for training, can't predict on new
+    * documents. When passing spotLDAHelper we recommend to make sure it's the same object it was passed for training.
+    *
+    * @param spotLDAHelper Spot LDA Helper object used for training
+    * @return SpotLDAResult
+    */
+  override def predict(spotLDAHelper: SpotLDAHelper): SpotLDAResult = {
+
+    val distributedLDAModel: DistributedLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
+
+    val topicDistributions = distributedLDAModel.topicDistributions
+    val topicsMatrix = distributedLDAModel.topicsMatrix
+
+    SpotLDAResult(spotLDAHelper, topicDistributions, topicsMatrix)
+  }
+}
+
+object SpotLDAModel {
+
+  /**
+    * Factory method, based on instance of ldaModel will generate an object based on DistributedLDAModel
+    * implementation or LocalLDAModel.
+    *
+    * @param ldaModel
+    * @param spotLDAHelper
+    * @return
+    */
+  def apply(ldaModel: LDAModel, spotLDAHelper: SpotLDAHelper = null): SpotLDAModel = {
+
+    ldaModel match {
+      case model: DistributedLDAModel => new SpotDistributedLDAModel(model)
+      case model: LocalLDAModel => new SpotLocalLDAModel(model)
+    }
+  }
+}

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/45c03ab6/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAResult.scala
----------------------------------------------------------------------
diff --git a/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAResult.scala b/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAResult.scala
new file mode 100644
index 0000000..a91cee2
--- /dev/null
+++ b/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAResult.scala
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spot.lda
+
+import org.apache.spark.mllib.linalg.{Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.DataFrame
+
+/**
+  * LDA results formatted for Apache Spot scoring.
+  *
+  */
+class SpotLDAResult(private final val helper: SpotLDAHelper,
+                    final val topicDistributions: RDD[(Long, Vector)],
+                    final val documentToTopicMix: DataFrame,
+                    final val topicsMix: Matrix,
+                    final val wordToTopicMix: Map[String, Array[Double]])
+
+object SpotLDAResult {
+
+  def apply(helper: SpotLDAHelper, topicDistributions: RDD[(Long, Vector)], topicsMix: Matrix): SpotLDAResult = {
+
+    val documentToTopicMix: DataFrame = helper.formatDocumentDistribution(topicDistributions)
+    val wordToTopicMix: Map[String, Array[Double]] = helper.formatTopicDistributions(topicsMix)
+
+    new SpotLDAResult(helper, topicDistributions, documentToTopicMix, topicsMix, wordToTopicMix)
+  }
+}

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/45c03ab6/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala
----------------------------------------------------------------------
diff --git a/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala b/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala
index 122e8ed..7a8b67e 100644
--- a/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala
+++ b/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala
@@ -18,19 +18,15 @@
 package org.apache.spot.lda
 
 import org.apache.log4j.Logger
-import org.apache.spark.mllib.clustering._
-import org.apache.spark.mllib.linalg.{Matrix, Vector, Vectors}
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.clustering.{LDAModel, _}
+import org.apache.spark.mllib.linalg.Vector
 import org.apache.spark.rdd.RDD
-import org.apache.spark.sql.functions._
-import org.apache.spark.sql.{DataFrame, Row, SparkSession}
-import org.apache.spot.lda.SpotLDAWrapperSchema._
-import org.apache.spot.utilities.FloatPointPrecisionUtility
-
-import scala.collection.immutable.Map
+import org.apache.spark.sql.SparkSession
 
 /**
   * Spark LDA implementation
-  * Contains routines for LDA using Scala Spark implementation from mllib
+  * Contains routines for LDA using Scala Spark implementation from org.apache.spark.mllib.clustering
   * 1. Creates list of unique documents, words and model based on those two
   * 2. Processes the model using Spark LDA
   * 3. Reads Spark LDA results: Topic distributions per document (docTopicDist) and word distributions per topic (wordTopicMat)
@@ -42,8 +38,6 @@ object SpotLDAWrapper {
   /**
     * Runs Spark LDA and returns a new model.
     *
-    * @param sparkSession       the SparkSession
-    * @param docWordCount       RDD with document list and the word count for each document (corpus)
     * @param topicCount         number of topics to find
     * @param logger             application logger
     * @param ldaSeed            LDA seed
@@ -51,51 +45,20 @@ object SpotLDAWrapper {
     * @param ldaBeta            topic concentration
     * @param ldaOptimizerOption LDA optimizer, em or online
     * @param maxIterations      maximum number of iterations for the optimizer
-    * @param precisionUtility   FloatPointPrecisionUtility implementation based on user configuration (64 or 32 bit)
     * @return
     */
-  def runLDA(sparkSession: SparkSession,
-             docWordCount: RDD[SpotLDAInput],
-             topicCount: Int,
-             logger: Logger,
-             ldaSeed: Option[Long],
-             ldaAlpha: Double,
-             ldaBeta: Double,
-             ldaOptimizerOption: String,
-             maxIterations: Int,
-             precisionUtility: FloatPointPrecisionUtility): SpotLDAOutput = {
-
-    import sparkSession.implicits._
+  def run(topicCount: Int,
+          logger: Logger,
+          ldaSeed: Option[Long],
+          ldaAlpha: Double,
+          ldaBeta: Double,
+          ldaOptimizerOption: String,
+          maxIterations: Int,
+          helper: SpotLDAHelper): SpotLDAModel = {
 
-    val docWordCountCache = docWordCount.cache()
-
-    // Forcing an action to cache results.
-    docWordCountCache.count()
-
-    // Create word Map Word,Index for further usage
-    val wordDictionary: Map[String, Int] = {
-      val words = docWordCountCache
-        .map({ case SpotLDAInput(doc, word, count) => word })
-        .distinct
-        .collect
-      words.zipWithIndex.toMap
-    }
-
-    val documentDictionary: DataFrame = docWordCountCache
-      .map({ case SpotLDAInput(doc, word, count) => doc })
-      .distinct
-      .zipWithIndex
-      .toDF(DocumentName, DocumentNumber)
-      .cache
 
     // Structure corpus so that the index is the docID, values are the vectors of word occurrences in that doc
-    val ldaCorpus: RDD[(Long, Vector)] =
-      formatSparkLDAInput(docWordCountCache,
-        documentDictionary,
-        wordDictionary,
-        sparkSession)
-
-    docWordCountCache.unpersist()
+    val ldaCorpus: RDD[(Long, Vector)] = helper.formattedCorpus
 
     // Instantiate optimizer based on input
     val ldaOptimizer = ldaOptimizerOption match {
@@ -121,162 +84,35 @@ object SpotLDAWrapper {
         .setBeta(ldaBeta)
         .setOptimizer(ldaOptimizer)
 
-    // If caller does not provide seed to lda, ie. ldaSeed is empty, lda is seeded automatically set to hash value of class name
-
+    // If caller does not provide seed to lda, ie. ldaSeed is empty,
+    // lda is seeded automatically set to hash value of class name
     if (ldaSeed.nonEmpty) {
       lda.setSeed(ldaSeed.get)
     }
 
-    val (wordTopicMat, docTopicDist) = ldaOptimizer match {
-      case _: EMLDAOptimizer => {
-        val ldaModel = lda.run(ldaCorpus).asInstanceOf[DistributedLDAModel]
-
-        // Get word topic mix, from Spark documentation:
-        // Inferred topics, where each topic is represented by a distribution over terms.
-        // This is a matrix of size vocabSize x k, where each column is a topic.
-        // No guarantees are given about the ordering of the topics.
-        val wordTopicMat: Matrix = ldaModel.topicsMatrix
-
-        // Topic distribution: for each document, return distribution (vector) over topics for that docs where entry
-        // i is the fraction of the document which belongs to topic i
-        val docTopicDist: RDD[(Long, Vector)] = ldaModel.topicDistributions
-
-        (wordTopicMat, docTopicDist)
-
-      }
-
-      case _: OnlineLDAOptimizer => {
-        val ldaModel = lda.run(ldaCorpus).asInstanceOf[LocalLDAModel]
-
-        // Get word topic mix, from Spark documentation:
-        // Inferred topics, where each topic is represented by a distribution over terms.
-        // This is a matrix of size vocabSize x k, where each column is a topic.
-        // No guarantees are given about the ordering of the topics.
-        val wordTopicMat: Matrix = ldaModel.topicsMatrix
-
-        // Topic distribution: for each document, return distribution (vector) over topics for that docs where entry
-        // i is the fraction of the document which belongs to topic i
-        val docTopicDist: RDD[(Long, Vector)] = ldaModel.topicDistributions(ldaCorpus)
-
-        (wordTopicMat, docTopicDist)
-
-      }
-
-    }
-
-    // Create doc results from vector: convert docID back to string, convert vector of probabilities to array
-    val docToTopicMixDF =
-      formatSparkLDADocTopicOutput(docTopicDist, documentDictionary, sparkSession, precisionUtility)
+    val model: LDAModel = lda.run(ldaCorpus)
 
-    documentDictionary.unpersist()
-
-    // Create word results from matrix: convert matrix to sequence, wordIDs back to strings, sequence of
-    // probabilities to array
-    val revWordMap: Map[Int, String] = wordDictionary.map(_.swap)
-
-    val wordResults = formatSparkLDAWordOutput(wordTopicMat, revWordMap)
-
-    // Create output object
-    SpotLDAOutput(docToTopicMixDF, wordResults)
+    SpotLDAModel(model)
   }
 
   /**
-    * Formats input data for LDA algorithm
+    * Load an existing model from HDFS location.
     *
-    * @param docWordCount       RDD with document list and the word count for each document (corpus)
-    * @param documentDictionary DataFrame with a distinct list of documents and its id
-    * @param wordDictionary     immutable Map with distinct list of word and its id
-    * @param sparkSession       the SparkSession
-    * @return
+    * @param sparkSession       the Spark session.
+    * @param location           the HDFS location for the model.
+    * @param ldaOptimizerOption LDA optimizer, em or online.
+    * @return SpotLDAModel
     */
-  def formatSparkLDAInput(docWordCount: RDD[SpotLDAInput],
-                          documentDictionary: DataFrame,
-                          wordDictionary: Map[String, Int],
-                          sparkSession: SparkSession): RDD[(Long, Vector)] = {
-
-    import sparkSession.implicits._
+  def load(sparkSession: SparkSession, location: String, ldaOptimizerOption: String): SpotLDAModel = {
+    val sparkContext: SparkContext = sparkSession.sparkContext
 
-    val getWordId = {
-      udf((word: String) => (wordDictionary(word)))
+    val model = ldaOptimizerOption match {
+      case "em" => DistributedLDAModel.load(sparkContext, location)
+      case "online" => LocalLDAModel.load(sparkContext, location)
+      case _ => throw new IllegalArgumentException(
+        s"Invalid LDA optimizer $ldaOptimizerOption")
     }
 
-    val docWordCountDF = docWordCount
-      .map({ case SpotLDAInput(doc, word, count) => (doc, word, count) })
-      .toDF(DocumentName, WordName, WordNameWordCount)
-
-    // Convert SpotSparkLDAInput into desired format for Spark LDA: (doc, word, count) -> word count per doc, where RDD
-    // is indexed by DocID
-    val wordCountsPerDocDF = docWordCountDF
-      .join(documentDictionary, docWordCountDF(DocumentName) === documentDictionary(DocumentName))
-      .drop(documentDictionary(DocumentName))
-      .withColumn(WordNumber, getWordId(docWordCountDF(WordName)))
-      .drop(WordName)
-
-    val wordCountsPerDoc: RDD[(Long, Iterable[(Int, Double)])]
-    = wordCountsPerDocDF
-      .select(DocumentNumber, WordNumber, WordNameWordCount)
-      .rdd
-      .map({ case Row(documentId: Long, wordId: Int, wordCount: Int) => (documentId.toLong, (wordId, wordCount.toDouble)) })
-      .groupByKey
-
-    // Sum of distinct words in each doc (words will be repeated between different docs), used for sparse vec size
-    val numUniqueWords = wordDictionary.size
-    val ldaInput: RDD[(Long, Vector)] = wordCountsPerDoc
-      .mapValues({ case vs => Vectors.sparse(numUniqueWords, vs.toSeq) })
-
-    ldaInput
-  }
-
-  /**
-    * Format LDA output topicMatrix for spot-ml scoring
-    *
-    * @param wordTopMat LDA model topicMatrix
-    * @param wordMap    immutable Map with distinct list of word and its id
-    * @return
-    */
-  def formatSparkLDAWordOutput(wordTopMat: Matrix, wordMap: Map[Int, String]): scala.Predef.Map[String, Array[Double]] = {
-
-    // incoming word top matrix is in column-major order and the columns are unnormalized
-    val m = wordTopMat.numRows
-    val n = wordTopMat.numCols
-    val columnSums: Array[Double] = Range(0, n).map(j => (Range(0, m).map(i => wordTopMat(i, j)).sum)).toArray
-
-    val wordProbs: Seq[Array[Double]] = wordTopMat.transpose.toArray.grouped(n).toSeq
-      .map(unnormProbs => unnormProbs.zipWithIndex.map({ case (u, j) => u / columnSums(j) }))
-
-    wordProbs.zipWithIndex.map({ case (topicProbs, wordInd) => (wordMap(wordInd), topicProbs) }).toMap
+    SpotLDAModel(model)
   }
-
-  /**
-    * Format LDA output topicDistribution for spot-ml scoring
-    *
-    * @param docTopDist         LDA model topicDistribution
-    * @param documentDictionary DataFrame with a distinct list of documents and its id
-    * @param sparkSession       the SparkSession
-    * @param precisionUtility   FloatPointPrecisionUtility implementation based on user configuration (64 or 32 bit)
-    * @return
-    */
-  def formatSparkLDADocTopicOutput(docTopDist: RDD[(Long, Vector)], documentDictionary: DataFrame, sparkSession: SparkSession,
-                                   precisionUtility: FloatPointPrecisionUtility):
-  DataFrame = {
-    import sparkSession.implicits._
-
-    val topicDistributionToArray = udf((topicDistribution: Vector) => topicDistribution.toArray)
-    val documentToTopicDistributionDF = docTopDist.toDF(DocumentNumber, TopicProbabilityMix)
-
-    val documentToTopicDistributionArray = documentToTopicDistributionDF
-      .join(documentDictionary, documentToTopicDistributionDF(DocumentNumber) === documentDictionary(DocumentNumber))
-      .drop(documentDictionary(DocumentNumber))
-      .drop(documentToTopicDistributionDF(DocumentNumber))
-      .select(DocumentName, TopicProbabilityMix)
-      .withColumn(TopicProbabilityMixArray, topicDistributionToArray(documentToTopicDistributionDF(TopicProbabilityMix)))
-      .selectExpr(s"$DocumentName  AS $DocumentName", s"$TopicProbabilityMixArray AS $TopicProbabilityMix")
-
-    precisionUtility.castColumn(documentToTopicDistributionArray, TopicProbabilityMix)
-  }
-
-  case class SpotLDAInput(doc: String, word: String, count: Int) extends Serializable
-
-  case class SpotLDAOutput(docToTopicMix: DataFrame, wordResults: Map[String, Array[Double]])
-
 }
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/45c03ab6/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowSuspiciousConnectsModel.scala
----------------------------------------------------------------------
diff --git a/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowSuspiciousConnectsModel.scala b/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowSuspiciousConnectsModel.scala
index 6be11e1..4e09616 100644
--- a/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowSuspiciousConnectsModel.scala
+++ b/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowSuspiciousConnectsModel.scala
@@ -20,11 +20,10 @@ package org.apache.spot.netflow.model
 import org.apache.log4j.Logger
 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types.StructType
-import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession}
 import org.apache.spot.SuspiciousConnectsArgumentParser.SuspiciousConnectsConfig
-import org.apache.spot.lda.SpotLDAWrapper
-import org.apache.spot.lda.SpotLDAWrapper.{SpotLDAInput, SpotLDAOutput}
 import org.apache.spot.lda.SpotLDAWrapperSchema._
+import org.apache.spot.lda.{SpotLDAHelper, SpotLDAInput, SpotLDAResult, SpotLDAWrapper}
 import org.apache.spot.netflow.FlowSchema._
 import org.apache.spot.netflow.FlowWordCreator
 import org.apache.spot.utilities.FloatPointPrecisionUtility
@@ -36,7 +35,7 @@ import org.apache.spot.utilities.data.validation.InvalidDataHandler
   * The model uses a topic-modelling approach that:
   * 1. Simplifies netflow records into words, one word at the source IP and another (possibly different) at the
   * destination IP.
-  * 2. The netflow words about each IP are treated as collections of thes words.
+  * 2. The netflow words about each IP are treated as collections of these words.
   * 3. A topic modelling approach is used to infer a collection of "topics" that represent common profiles
   * of network traffic. These "topics" are probability distributions on words.
   * 4. Each IP has a mix of topics corresponding to its behavior.
@@ -112,7 +111,7 @@ class FlowSuspiciousConnectsModel(topicCount: Int,
 }
 
 /**
-  * Contains dataframe schema information as well as the train-from-dataframe routine
+  * Contains DataFrame schema information as well as the train-from-dataframe routine
   * (which is a kind of factory routine) for [[FlowSuspiciousConnectsModel]] instances.
   *
   */
@@ -127,7 +126,7 @@ object FlowSuspiciousConnectsModel {
     IbytField,
     IpktField))
 
-  val ModelColumns = ModelSchema.fieldNames.toList.map(col)
+  val ModelColumns: List[Column] = ModelSchema.fieldNames.toList.map(col)
 
 
   def trainModel(sparkSession: SparkSession,
@@ -146,13 +145,12 @@ object FlowSuspiciousConnectsModel {
       config.duplicationFactor))
 
 
+    import sparkSession.implicits._
     // simplify netflow log entries into "words"
 
     val dataWithWords = totalRecords.withColumn(SourceWord, FlowWordCreator.srcWordUDF(ModelColumns: _*))
       .withColumn(DestinationWord, FlowWordCreator.dstWordUDF(ModelColumns: _*))
 
-    import sparkSession.implicits._
-
     // Aggregate per-word counts at each IP
     val srcWordCounts = dataWithWords
       .filter(dataWithWords(SourceWord).notEqual(InvalidDataHandler.WordError))
@@ -173,20 +171,19 @@ object FlowSuspiciousConnectsModel {
         .reduceByKey(_ + _)
         .map({ case ((ip, word), count) => SpotLDAInput(ip, word, count) })
 
+    val spotLDAHelper: SpotLDAHelper = SpotLDAHelper(ipWordCounts, config.precisionUtility, sparkSession)
 
-    val SpotLDAOutput(ipToTopicMix, wordToPerTopicProb) = SpotLDAWrapper.runLDA(sparkSession,
-      ipWordCounts,
-      config.topicCount,
+    val model = SpotLDAWrapper.run(config.topicCount,
       logger,
       config.ldaPRGSeed,
       config.ldaAlpha,
       config.ldaBeta,
       config.ldaOptimizer,
       config.ldaMaxiterations,
-      config.precisionUtility)
+      spotLDAHelper)
+
+    val results: SpotLDAResult = model.predict(spotLDAHelper)
 
-    new FlowSuspiciousConnectsModel(config.topicCount,
-      ipToTopicMix,
-      wordToPerTopicProb)
+    new FlowSuspiciousConnectsModel(config.topicCount, results.documentToTopicMix, results.wordToTopicMix)
   }
 }
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/45c03ab6/spot-ml/src/main/scala/org/apache/spot/proxy/ProxySuspiciousConnectsModel.scala
----------------------------------------------------------------------
diff --git a/spot-ml/src/main/scala/org/apache/spot/proxy/ProxySuspiciousConnectsModel.scala b/spot-ml/src/main/scala/org/apache/spot/proxy/ProxySuspiciousConnectsModel.scala
index 3ef60af..7332fe4 100644
--- a/spot-ml/src/main/scala/org/apache/spot/proxy/ProxySuspiciousConnectsModel.scala
+++ b/spot-ml/src/main/scala/org/apache/spot/proxy/ProxySuspiciousConnectsModel.scala
@@ -25,9 +25,8 @@ import org.apache.spark.sql.types.StructType
 import org.apache.spark.sql.{DataFrame, Row, SparkSession}
 import org.apache.spot.SuspiciousConnectsArgumentParser.SuspiciousConnectsConfig
 import org.apache.spot.SuspiciousConnectsScoreFunction
-import org.apache.spot.lda.SpotLDAWrapper
-import org.apache.spot.lda.SpotLDAWrapper.{SpotLDAInput, SpotLDAOutput}
 import org.apache.spot.lda.SpotLDAWrapperSchema._
+import org.apache.spot.lda.{SpotLDAHelper, SpotLDAInput, SpotLDAResult, SpotLDAWrapper}
 import org.apache.spot.proxy.ProxySchema._
 import org.apache.spot.utilities._
 import org.apache.spot.utilities.data.validation.InvalidDataHandler
@@ -92,7 +91,7 @@ class ProxySuspiciousConnectsModel(topicCount: Int,
   */
 object ProxySuspiciousConnectsModel {
 
-  // These buckets are optimized to datasets used for training. Last bucket is of infinite size to ensure fit.
+  // These buckets are optimized to data sets used for training. Last bucket is of infinite size to ensure fit.
   // The maximum value of entropy is given by log k where k is the number of distinct categories.
   // Given that the alphabet and number of characters is finite the maximum value for entropy is upper bounded.
   // Bucket number and size can be changed to provide less/more granularity
@@ -119,8 +118,8 @@ object ProxySuspiciousConnectsModel {
     * for clustering in the topic model.
     *
     * @param sparkSession Spark Session
-    * @param logger       Logge object.
-    * @param config       SuspiciousConnetsArgumnetParser.Config object containg CLI arguments.
+    * @param logger       Logger object.
+    * @param config       SuspiciousConnectsArgumentParser.Config object containing CLI arguments.
     * @param inputRecords Dataframe for training data, with columns Host, Time, ReqMethod, FullURI, ResponseContentType,
     *                     UserAgent, RespCode (as defined in ProxySchema object).
     * @return ProxySuspiciousConnectsModel
@@ -130,7 +129,7 @@ object ProxySuspiciousConnectsModel {
                  config: SuspiciousConnectsConfig,
                  inputRecords: DataFrame): ProxySuspiciousConnectsModel = {
 
-    logger.info("training new proxy suspcious connects model")
+    logger.info("training new proxy suspicious connects model")
 
 
     val selectedRecords =
@@ -145,24 +144,24 @@ object ProxySuspiciousConnectsModel {
         .reduceByKey(_ + _).collect()
         .toMap
 
-    val agentToCountBC = sparkSession.sparkContext.broadcast(agentToCount)
-
     val docWordCount: RDD[SpotLDAInput] =
       getIPWordCounts(sparkSession, logger, selectedRecords, config.feedbackFile, config.duplicationFactor,
         agentToCount)
 
-    val SpotLDAOutput(ipToTopicMixDF, wordResults) = SpotLDAWrapper.runLDA(sparkSession,
-      docWordCount,
-      config.topicCount,
+    val spotLDAHelper: SpotLDAHelper = SpotLDAHelper(docWordCount, config.precisionUtility, sparkSession)
+
+    val model = SpotLDAWrapper.run(config.topicCount,
       logger,
       config.ldaPRGSeed,
       config.ldaAlpha,
       config.ldaBeta,
       config.ldaOptimizer,
       config.ldaMaxiterations,
-      config.precisionUtility)
+      spotLDAHelper)
+
+    val results: SpotLDAResult = model.predict(spotLDAHelper)
 
-    new ProxySuspiciousConnectsModel(config.topicCount, ipToTopicMixDF, wordResults)
+    new ProxySuspiciousConnectsModel(config.topicCount, results.documentToTopicMix, results.wordToTopicMix)
 
   }
 

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/45c03ab6/spot-ml/src/main/scala/org/apache/spot/utilities/TopDomains.scala
----------------------------------------------------------------------
diff --git a/spot-ml/src/main/scala/org/apache/spot/utilities/TopDomains.scala b/spot-ml/src/main/scala/org/apache/spot/utilities/TopDomains.scala
index 0183027..083cfe7 100644
--- a/spot-ml/src/main/scala/org/apache/spot/utilities/TopDomains.scala
+++ b/spot-ml/src/main/scala/org/apache/spot/utilities/TopDomains.scala
@@ -28,7 +28,6 @@ object TopDomains {
 
   val TopDomains: Set[String] = Source.fromFile(alexaTop1MPath).getLines.map(line => {
     val parts = line.split(",")
-    val l = parts.length
     parts(1).split('.')(0)
   }).toSet
 }

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/45c03ab6/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAHelperTest.scala
----------------------------------------------------------------------
diff --git a/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAHelperTest.scala b/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAHelperTest.scala
new file mode 100644
index 0000000..93828b2
--- /dev/null
+++ b/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAHelperTest.scala
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spot.lda
+
+import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spot.lda.SpotLDAWrapperSchema.TopicProbabilityMix
+import org.apache.spot.testutils.TestingSparkContextFlatSpec
+import org.apache.spot.utilities.{FloatPointPrecisionUtility32, FloatPointPrecisionUtility64}
+import org.scalatest.Matchers
+
+/**
+  * Created by rabarona on 7/17/17.
+  */
+class SpotLDAHelperTest extends TestingSparkContextFlatSpec with Matchers {
+
+  "formatSparkLDAInput" should "return input in RDD[(Long, Vector)] (collected as Array for testing) format. The index " +
+    "is the docID, values are the vectors of word occurrences in that doc" in {
+
+
+    val documentWordData = sparkSession.sparkContext.parallelize(Seq(SpotLDAInput("192.168.1.1", "333333_7.0_0.0_1.0", 8),
+      SpotLDAInput("10.10.98.123", "1111111_6.0_3.0_5.0", 4),
+      SpotLDAInput("66.23.45.11", "-1_43_7.0_2.0_6.0", 2),
+      SpotLDAInput("192.168.1.1", "-1_80_6.0_1.0_1.0", 5)))
+
+    val spotLDAHelper: SpotLDAHelper = SpotLDAHelper(documentWordData, FloatPointPrecisionUtility64, sparkSession)
+
+    val sparkLDAInput: RDD[(Long, Vector)] = spotLDAHelper.formattedCorpus
+    val sparkLDAInArr: Array[(Long, Vector)] = sparkLDAInput.collect()
+
+    sparkLDAInArr shouldBe Array((0, Vectors.sparse(4, Array(0, 3), Array(5.0, 8.0))), (2, Vectors.sparse(4, Array
+    (1), Array(2.0))), (1, Vectors.sparse(4, Array(2), Array(4.0))))
+  }
+
+  "formatSparkLDADocTopicOutput" should "return RDD[(String,Array(Double))] after converting doc results from vector " +
+    "using PrecisionUtilityDouble: convert docID back to string, convert vector of probabilities to array" in {
+
+    val documentWordData = sparkSession.sparkContext.parallelize(Seq(SpotLDAInput("192.168.1.1", "333333_7.0_0.0_1.0", 8),
+      SpotLDAInput("10.10.98.123", "1111111_6.0_3.0_5.0", 4),
+      SpotLDAInput("66.23.45.11", "-1_43_7.0_2.0_6.0", 2),
+      SpotLDAInput("192.168.1.1", "-1_80_6.0_1.0_1.0", 5)))
+
+    val docTopicDist: RDD[(Long, Vector)] = sparkSession.sparkContext.parallelize(
+      Array((0.toLong, Vectors.dense(0.15, 0.3, 0.5, 0.05)),
+        (1.toLong, Vectors.dense(0.25, 0.15, 0.4, 0.2)),
+        (2.toLong, Vectors.dense(0.4, 0.1, 0.3, 0.2))))
+
+    val spotLDAHelper: SpotLDAHelper = SpotLDAHelper(documentWordData, FloatPointPrecisionUtility64, sparkSession)
+
+    val sparkDocRes: DataFrame = spotLDAHelper.formatDocumentDistribution(docTopicDist)
+
+    import testImplicits._
+    val documents = sparkDocRes.map({ case Row(documentName: String, docProbabilities: Seq[Double]) => (documentName,
+      docProbabilities)
+    }).collect
+
+    val documentProbabilities = sparkDocRes.select(TopicProbabilityMix).first.toSeq(0).asInstanceOf[Seq[Double]]
+
+    documents should contain("192.168.1.1", Seq(0.15, 0.3, 0.5, 0.05))
+    documents should contain("10.10.98.123", Seq(0.25, 0.15, 0.4, 0.2))
+    documents should contain("66.23.45.11", Seq(0.4, 0.1, 0.3, 0.2))
+
+    documentProbabilities(0) shouldBe a[java.lang.Double]
+
+  }
+
+  it should "return RDD[(String,Array(Float))] after converting doc results from vector " +
+    "using PrecisionUtilityFloat: convert docID back to string, convert vector of probabilities to array" in {
+
+    val documentWordData = sparkSession.sparkContext.parallelize(Seq(SpotLDAInput("192.168.1.1", "333333_7.0_0.0_1.0", 8),
+      SpotLDAInput("10.10.98.123", "1111111_6.0_3.0_5.0", 4),
+      SpotLDAInput("66.23.45.11", "-1_43_7.0_2.0_6.0", 2),
+      SpotLDAInput("192.168.1.1", "-1_80_6.0_1.0_1.0", 5)))
+
+    val spotLDAHelper: SpotLDAHelper = SpotLDAHelper(documentWordData, FloatPointPrecisionUtility32, sparkSession)
+
+    val docTopicDist: RDD[(Long, Vector)] = sparkSession.sparkContext.parallelize(
+      Array((0.toLong, Vectors.dense(0.15, 0.3, 0.5, 0.05)),
+        (1.toLong, Vectors.dense(0.25, 0.15, 0.4, 0.2)),
+        (2.toLong, Vectors.dense(0.4, 0.1, 0.3, 0.2))))
+
+    val sparkDocRes: DataFrame = spotLDAHelper.formatDocumentDistribution(docTopicDist)
+
+    import testImplicits._
+    val documents = sparkDocRes.map({ case Row(documentName: String, docProbabilities: Seq[Float]) => (documentName,
+      docProbabilities)
+    }).collect
+
+    val documentProbabilities = sparkDocRes.select(TopicProbabilityMix).first.toSeq(0).asInstanceOf[Seq[Float]]
+
+    documents should contain("192.168.1.1", Seq(0.15f, 0.3f, 0.5f, 0.05f))
+    documents should contain("10.10.98.123", Seq(0.25f, 0.15f, 0.4f, 0.2f))
+    documents should contain("66.23.45.11", Seq(0.4f, 0.1f, 0.3f, 0.2f))
+
+    documentProbabilities(0) shouldBe a[java.lang.Float]
+  }
+
+  "formatSparkLDAWordOutput" should "return Map[Int,String] after converting word matrix to sequence, wordIDs back " +
+    "to strings, and sequence of probabilities to array" in {
+
+    val testMat = Matrices.dense(4, 4, Array(0.5, 0.2, 0.05, 0.25, 0.25, 0.1, 0.15, 0.5, 0.1, 0.4, 0.25, 0.25, 0.7, 0.2, 0.02, 0.08))
+
+    val documentWordData = sparkSession.sparkContext.parallelize(Seq(SpotLDAInput("192.168.1.1", "23.0_7.0_7.0_4.0", 8),
+      SpotLDAInput("10.10.98.123", "80.0_7.0_7.0_4.0", 4),
+      SpotLDAInput("66.23.45.11", "333333.0_7.0_7.0_4.0", 2),
+      SpotLDAInput("192.168.1.2", "-1_23.0_7.0_7.0_4.0", 5)))
+
+    val spotLDAHelper: SpotLDAHelper = SpotLDAHelper(documentWordData, FloatPointPrecisionUtility64, sparkSession)
+
+    val sparkWordRes = spotLDAHelper.formatTopicDistributions(testMat)
+
+    sparkWordRes should contain key ("23.0_7.0_7.0_4.0")
+    sparkWordRes should contain key ("80.0_7.0_7.0_4.0")
+    sparkWordRes should contain key ("333333.0_7.0_7.0_4.0")
+    sparkWordRes should contain key ("-1_23.0_7.0_7.0_4.0")
+  }
+}

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/45c03ab6/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAWrapperTest.scala
----------------------------------------------------------------------
diff --git a/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAWrapperTest.scala b/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAWrapperTest.scala
index 5c40068..7007ba1 100644
--- a/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAWrapperTest.scala
+++ b/spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAWrapperTest.scala
@@ -18,25 +18,18 @@
 package org.apache.spot.lda
 
 import org.apache.log4j.{Level, LogManager}
-import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
-import org.apache.spark.rdd.RDD
-import org.apache.spark.sql.types.StructType
-import org.apache.spark.sql.{DataFrame, Row}
-import org.apache.spot.lda.SpotLDAWrapper._
 import org.apache.spot.lda.SpotLDAWrapperSchema._
 import org.apache.spot.testutils.TestingSparkContextFlatSpec
 import org.apache.spot.utilities.{FloatPointPrecisionUtility32, FloatPointPrecisionUtility64}
 import org.scalatest.Matchers
 
-import scala.collection.immutable.Map
-
 class SpotLDAWrapperTest extends TestingSparkContextFlatSpec with Matchers {
 
   "SparkLDA" should "handle an extremely unbalanced two word doc with EM optimizer" in {
     val logger = LogManager.getLogger("SuspiciousConnectsAnalysis")
     logger.setLevel(Level.WARN)
 
-    val ldaAlpha =  1.02
+    val ldaAlpha = 1.02
     val ldaBeta = 1.001
     val ldaMaxIterations = 20
 
@@ -46,16 +39,20 @@ class SpotLDAWrapperTest extends TestingSparkContextFlatSpec with Matchers {
     val dogWorld = SpotLDAInput("pets", "dog", 999)
 
     val data = sparkSession.sparkContext.parallelize(Seq(catFancy, dogWorld))
-    val out = SpotLDAWrapper.runLDA(sparkSession, data, 2, logger, Some(0xdeadbeef), ldaAlpha, ldaBeta,
-      optimizer ,ldaMaxIterations, FloatPointPrecisionUtility64)
 
-    val topicMixDF = out.docToTopicMix
+    val spotLDAHelper: SpotLDAHelper = SpotLDAHelper(data, FloatPointPrecisionUtility64, sparkSession)
+    val model: SpotLDAModel = SpotLDAWrapper.run(2, logger, Some(0xdeadbeef), ldaAlpha, ldaBeta,
+      optimizer, ldaMaxIterations, spotLDAHelper)
+
+    val results = model.predict(spotLDAHelper)
+
+    val topicMixDF = results.documentToTopicMix
 
     val topicMix =
-      topicMixDF.filter(topicMixDF(DocumentName) === "pets").select(TopicProbabilityMix).first().toSeq(0)
+      topicMixDF.filter(topicMixDF(DocumentName) === "pets").select(TopicProbabilityMix).first().toSeq.head
         .asInstanceOf[Seq[Double]].toArray
-    val catTopics = out.wordResults("cat")
-    val dogTopics = out.wordResults("dog")
+    val catTopics = results.wordToTopicMix("cat")
+    val dogTopics = results.wordToTopicMix("dog")
 
     Math.abs(topicMix(0) * catTopics(0) + topicMix(1) * catTopics(1)) should be < 0.01
     Math.abs(0.999 - (topicMix(0) * dogTopics(0) + topicMix(1) * dogTopics(1))) should be < 0.01
@@ -65,9 +62,9 @@ class SpotLDAWrapperTest extends TestingSparkContextFlatSpec with Matchers {
     val logger = LogManager.getLogger("SuspiciousConnectsAnalysis")
     logger.setLevel(Level.WARN)
 
-    val ldaAlpha =  1.2
-    val ldaBeta = 1.001
-    val ldaMaxIterations = 20
+    val ldaAlpha = 1.002
+    val ldaBeta = 1.0001
+    val ldaMaxIterations = 100
 
     val optimizer = "em"
 
@@ -75,20 +72,24 @@ class SpotLDAWrapperTest extends TestingSparkContextFlatSpec with Matchers {
     val dogWorld = SpotLDAInput("dog world", "dog", 1)
 
     val data = sparkSession.sparkContext.parallelize(Seq(catFancy, dogWorld))
-    val out = SpotLDAWrapper.runLDA(sparkSession, data, 2, logger, Some(0xdeadbeef), ldaAlpha, ldaBeta,
-      optimizer ,ldaMaxIterations, FloatPointPrecisionUtility64)
 
-    val topicMixDF = out.docToTopicMix
+    val spotLDAHelper: SpotLDAHelper = SpotLDAHelper(data, FloatPointPrecisionUtility64, sparkSession)
+    val model: SpotLDAModel = SpotLDAWrapper.run(2, logger, Some(0xdeadbeef), ldaAlpha, ldaBeta,
+      optimizer, ldaMaxIterations, spotLDAHelper)
+
+    val results = model.predict(spotLDAHelper)
+
+    val topicMixDF = results.documentToTopicMix
     val dogTopicMix: Array[Double] =
       topicMixDF.filter(topicMixDF(DocumentName) === "dog world").select(TopicProbabilityMix).first()
-        .toSeq(0).asInstanceOf[Seq[Double]].toArray
+        .toSeq.head.asInstanceOf[Seq[Double]].toArray
 
     val catTopicMix: Array[Double] =
       topicMixDF.filter(topicMixDF(DocumentName) === "cat fancy").select(TopicProbabilityMix).first()
-        .toSeq(0).asInstanceOf[Seq[Double]].toArray
+        .toSeq.head.asInstanceOf[Seq[Double]].toArray
 
-    val catTopics = out.wordResults("cat")
-    val dogTopics = out.wordResults("dog")
+    val catTopics = results.wordToTopicMix("cat")
+    val dogTopics = results.wordToTopicMix("dog")
 
     Math.abs(1 - (catTopicMix(0) * catTopics(0) + catTopicMix(1) * catTopics(1))) should be < 0.01
     Math.abs(1 - (dogTopicMix(0) * dogTopics(0) + dogTopicMix(1) * dogTopics(1))) should be < 0.01
@@ -98,7 +99,7 @@ class SpotLDAWrapperTest extends TestingSparkContextFlatSpec with Matchers {
     val logger = LogManager.getLogger("SuspiciousConnectsAnalysis")
     logger.setLevel(Level.WARN)
 
-    val ldaAlpha =  0.0009
+    val ldaAlpha = 0.0009
     val ldaBeta = 0.00001
     val ldaMaxIterations = 400
 
@@ -108,16 +109,20 @@ class SpotLDAWrapperTest extends TestingSparkContextFlatSpec with Matchers {
     val dogWorld = SpotLDAInput("pets", "dog", 999)
 
     val data = sparkSession.sparkContext.parallelize(Seq(catFancy, dogWorld))
-    val out = SpotLDAWrapper.runLDA(sparkSession, data, 2, logger, Some(0xdeadbeef), ldaAlpha, ldaBeta,
-      optimizer, ldaMaxIterations, FloatPointPrecisionUtility64)
 
-    val topicMixDF = out.docToTopicMix
+    val spotLDAHelper: SpotLDAHelper = SpotLDAHelper(data, FloatPointPrecisionUtility64, sparkSession)
+    val model: SpotLDAModel = SpotLDAWrapper.run(2, logger, Some(0xdeadbeef), ldaAlpha, ldaBeta,
+      optimizer, ldaMaxIterations, spotLDAHelper)
+
+    val results = model.predict(spotLDAHelper)
+
+    val topicMixDF = results.documentToTopicMix
 
     val topicMix =
-      topicMixDF.filter(topicMixDF(DocumentName) === "pets").select(TopicProbabilityMix).first().toSeq(0)
+      topicMixDF.filter(topicMixDF(DocumentName) === "pets").select(TopicProbabilityMix).first().toSeq.head
         .asInstanceOf[Seq[Double]].toArray
-    val catTopics = out.wordResults("cat")
-    val dogTopics = out.wordResults("dog")
+    val catTopics = results.wordToTopicMix("cat")
+    val dogTopics = results.wordToTopicMix("dog")
 
     Math.abs(topicMix(0) * catTopics(0) + topicMix(1) * catTopics(1)) should be < 0.01
     Math.abs(0.999 - (topicMix(0) * dogTopics(0) + topicMix(1) * dogTopics(1))) should be < 0.01
@@ -127,7 +132,7 @@ class SpotLDAWrapperTest extends TestingSparkContextFlatSpec with Matchers {
     val logger = LogManager.getLogger("SuspiciousConnectsAnalysis")
     logger.setLevel(Level.WARN)
 
-    val ldaAlpha =  0.0009
+    val ldaAlpha = 0.0009
     val ldaBeta = 0.00001
     val ldaMaxIterations = 400
     val optimizer = "online"
@@ -136,20 +141,25 @@ class SpotLDAWrapperTest extends TestingSparkContextFlatSpec with Matchers {
     val dogWorld = SpotLDAInput("dog world", "dog", 1)
 
     val data = sparkSession.sparkContext.parallelize(Seq(catFancy, dogWorld))
-    val out = SpotLDAWrapper.runLDA(sparkSession, data, 2, logger, Some(0xdeadbeef), ldaAlpha, ldaBeta,
-      optimizer, ldaMaxIterations, FloatPointPrecisionUtility64)
 
-    val topicMixDF = out.docToTopicMix
+    val spotLDAHelper: SpotLDAHelper = SpotLDAHelper(data, FloatPointPrecisionUtility64, sparkSession)
+    val model: SpotLDAModel = SpotLDAWrapper.run(2, logger, Some(0xdeadbeef), ldaAlpha, ldaBeta,
+      optimizer, ldaMaxIterations, spotLDAHelper)
+
+    val results = model.predict(spotLDAHelper)
+
+    val topicMixDF = results.documentToTopicMix
+
     val dogTopicMix: Array[Double] =
       topicMixDF.filter(topicMixDF(DocumentName) === "dog world").select(TopicProbabilityMix).first()
-        .toSeq(0).asInstanceOf[Seq[Double]].toArray
+        .toSeq.head.asInstanceOf[Seq[Double]].toArray
 
     val catTopicMix: Array[Double] =
       topicMixDF.filter(topicMixDF(DocumentName) === "cat fancy").select(TopicProbabilityMix).first()
-        .toSeq(0).asInstanceOf[Seq[Double]].toArray
+        .toSeq.head.asInstanceOf[Seq[Double]].toArray
 
-    val catTopics = out.wordResults("cat")
-    val dogTopics = out.wordResults("dog")
+    val catTopics = results.wordToTopicMix("cat")
+    val dogTopics = results.wordToTopicMix("dog")
 
     Math.abs(1 - (catTopicMix(0) * catTopics(0) + catTopicMix(1) * catTopics(1))) should be < 0.01
     Math.abs(1 - (dogTopicMix(0) * dogTopics(0) + dogTopicMix(1) * dogTopics(1))) should be < 0.01
@@ -159,7 +169,7 @@ class SpotLDAWrapperTest extends TestingSparkContextFlatSpec with Matchers {
     val logger = LogManager.getLogger("SuspiciousConnectsAnalysis")
     logger.setLevel(Level.WARN)
 
-    val ldaAlpha =  1.02
+    val ldaAlpha = 1.02
     val ldaBeta = 1.001
     val ldaMaxIterations = 20
 
@@ -169,16 +179,20 @@ class SpotLDAWrapperTest extends TestingSparkContextFlatSpec with Matchers {
     val dogWorld = SpotLDAInput("pets", "dog", 999)
 
     val data = sparkSession.sparkContext.parallelize(Seq(catFancy, dogWorld))
-    val out = SpotLDAWrapper.runLDA(sparkSession, data, 2, logger, Some(0xdeadbeef), ldaAlpha, ldaBeta,
-      optimizer, ldaMaxIterations, FloatPointPrecisionUtility32)
 
-    val topicMixDF = out.docToTopicMix
+    val spotLDAHelper: SpotLDAHelper = SpotLDAHelper(data, FloatPointPrecisionUtility32, sparkSession)
+    val model: SpotLDAModel = SpotLDAWrapper.run(2, logger, Some(0xdeadbeef), ldaAlpha, ldaBeta,
+      optimizer, ldaMaxIterations, spotLDAHelper)
+
+    val results = model.predict(spotLDAHelper)
+
+    val topicMixDF = results.documentToTopicMix
 
     val topicMix =
-      topicMixDF.filter(topicMixDF(DocumentName) === "pets").select(TopicProbabilityMix).first().toSeq(0)
+      topicMixDF.filter(topicMixDF(DocumentName) === "pets").select(TopicProbabilityMix).first().toSeq.head
         .asInstanceOf[Seq[Float]].toArray
-    val catTopics = out.wordResults("cat")
-    val dogTopics = out.wordResults("dog")
+    val catTopics = results.wordToTopicMix("cat")
+    val dogTopics = results.wordToTopicMix("dog")
 
     Math.abs(topicMix(0).toDouble * catTopics(0) + topicMix(1).toDouble * catTopics(1)) should be < 0.01
     Math.abs(0.999 - (topicMix(0).toDouble * dogTopics(0) + topicMix(1).toDouble * dogTopics(1))) should be < 0.01
@@ -188,7 +202,7 @@ class SpotLDAWrapperTest extends TestingSparkContextFlatSpec with Matchers {
     val logger = LogManager.getLogger("SuspiciousConnectsAnalysis")
     logger.setLevel(Level.WARN)
 
-    val ldaAlpha =  1.02
+    val ldaAlpha = 1.02
     val ldaBeta = 1.001
     val ldaMaxIterations = 20
 
@@ -198,134 +212,28 @@ class SpotLDAWrapperTest extends TestingSparkContextFlatSpec with Matchers {
     val dogWorld = SpotLDAInput("dog world", "dog", 1)
 
     val data = sparkSession.sparkContext.parallelize(Seq(catFancy, dogWorld))
-    val out = SpotLDAWrapper.runLDA(sparkSession, data, 2, logger, Some(0xdeadbeef), ldaAlpha, ldaBeta,
-      optimizer, ldaMaxIterations, FloatPointPrecisionUtility32)
 
-    val topicMixDF = out.docToTopicMix
+    val spotLDAHelper: SpotLDAHelper = SpotLDAHelper(data, FloatPointPrecisionUtility32, sparkSession)
+    val model: SpotLDAModel = SpotLDAWrapper.run(2, logger, Some(0xdeadbeef), ldaAlpha, ldaBeta,
+      optimizer, ldaMaxIterations, spotLDAHelper)
+
+    val results = model.predict(spotLDAHelper)
+
+    val topicMixDF = results.documentToTopicMix
+
     val dogTopicMix: Array[Float] =
-      topicMixDF.filter(topicMixDF(DocumentName) === "dog world").select(TopicProbabilityMix).first().toSeq(0)
+      topicMixDF.filter(topicMixDF(DocumentName) === "dog world").select(TopicProbabilityMix).first().toSeq.head
         .asInstanceOf[Seq[Float]].toArray
 
     val catTopicMix: Array[Float] =
-      topicMixDF.filter(topicMixDF(DocumentName) === "cat fancy").select(TopicProbabilityMix).first().toSeq(0)
+      topicMixDF.filter(topicMixDF(DocumentName) === "cat fancy").select(TopicProbabilityMix).first().toSeq.head
         .asInstanceOf[Seq[Float]].toArray
 
-    val catTopics = out.wordResults("cat")
-    val dogTopics = out.wordResults("dog")
+    val catTopics = results.wordToTopicMix("cat")
+    val dogTopics = results.wordToTopicMix("dog")
 
     Math.abs(1 - (catTopicMix(0) * catTopics(0) + catTopicMix(1) * catTopics(1))) should be < 0.01
     Math.abs(1 - (dogTopicMix(0) * dogTopics(0) + dogTopicMix(1) * dogTopics(1))) should be < 0.01
   }
 
-  "formatSparkLDAInput" should "return input in RDD[(Long, Vector)] (collected as Array for testing) format. The index " +
-    "is the docID, values are the vectors of word occurrences in that doc" in {
-
-
-    val documentWordData = sparkSession.sparkContext.parallelize(Seq(SpotLDAInput("192.168.1.1", "333333_7.0_0.0_1.0", 8),
-      SpotLDAInput("10.10.98.123", "1111111_6.0_3.0_5.0", 4),
-      SpotLDAInput("66.23.45.11", "-1_43_7.0_2.0_6.0", 2),
-      SpotLDAInput("192.168.1.1", "-1_80_6.0_1.0_1.0", 5)))
-
-    val wordDictionary = Map("333333_7.0_0.0_1.0" -> 0, "1111111_6.0_3.0_5.0" -> 1, "-1_43_7.0_2.0_6.0" -> 2,
-      "-1_80_6.0_1.0_1.0" -> 3)
-
-    val documentDictionary: DataFrame = sparkSession.createDataFrame(documentWordData
-      .map({ case SpotLDAInput(doc, word, count) => doc })
-      .distinct
-      .zipWithIndex.map({ case (d, c) => Row(d, c) }), StructType(List(DocumentNameField, DocumentNumberField)))
-
-
-    val sparkLDAInput: RDD[(Long, Vector)] = SpotLDAWrapper.formatSparkLDAInput(documentWordData,
-      documentDictionary, wordDictionary, sparkSession)
-    val sparkLDAInArr: Array[(Long, Vector)] = sparkLDAInput.collect()
-
-    sparkLDAInArr shouldBe Array((0, Vectors.sparse(4, Array(0, 3), Array(8.0, 5.0))), (2, Vectors.sparse(4, Array
-    (2), Array(2.0))), (1, Vectors.sparse(4, Array(1), Array(4.0))))
-  }
-
-  "formatSparkLDADocTopicOutput" should "return RDD[(String,Array(Double))] after converting doc results from vector " +
-    "using PrecisionUtilityDouble: convert docID back to string, convert vector of probabilities to array" in {
-
-    val documentWordData = sparkSession.sparkContext.parallelize(Seq(SpotLDAInput("192.168.1.1", "333333_7.0_0.0_1.0", 8),
-      SpotLDAInput("10.10.98.123", "1111111_6.0_3.0_5.0", 4),
-      SpotLDAInput("66.23.45.11", "-1_43_7.0_2.0_6.0", 2),
-      SpotLDAInput("192.168.1.1", "-1_80_6.0_1.0_1.0", 5)))
-
-    val documentDictionary: DataFrame = sparkSession.createDataFrame(documentWordData
-      .map({ case SpotLDAInput(doc, word, count) => doc })
-      .distinct
-      .zipWithIndex.map({ case (d, c) => Row(d, c) }), StructType(List(DocumentNameField, DocumentNumberField)))
-
-    val docTopicDist: RDD[(Long, Vector)] = sparkSession.sparkContext.parallelize(
-      Array((0.toLong, Vectors.dense(0.15, 0.3, 0.5, 0.05)),
-        (1.toLong, Vectors.dense(0.25, 0.15, 0.4, 0.2)),
-        (2.toLong, Vectors.dense(0.4, 0.1, 0.3, 0.2))))
-
-    val sparkDocRes: DataFrame = formatSparkLDADocTopicOutput(docTopicDist, documentDictionary, sparkSession,
-      FloatPointPrecisionUtility64)
-
-    import testImplicits._
-    val documents = sparkDocRes.map({ case Row(documentName: String, docProbabilities: Seq[Double]) => (documentName,
-      docProbabilities)
-    }).collect
-
-    val documentProbabilities = sparkDocRes.select(TopicProbabilityMix).first.toSeq(0).asInstanceOf[Seq[Double]]
-
-    documents should contain("192.168.1.1", Seq(0.15, 0.3, 0.5, 0.05))
-    documents should contain("10.10.98.123", Seq(0.25, 0.15, 0.4, 0.2))
-    documents should contain("66.23.45.11", Seq(0.4, 0.1, 0.3, 0.2))
-
-    documentProbabilities(0) shouldBe a[java.lang.Double]
-
-  }
-
-  it should "return RDD[(String,Array(Float))] after converting doc results from vector " +
-    "using PrecisionUtilityFloat: convert docID back to string, convert vector of probabilities to array" in {
-
-    val documentWordData = sparkSession.sparkContext.parallelize(Seq(SpotLDAInput("192.168.1.1", "333333_7.0_0.0_1.0", 8),
-      SpotLDAInput("10.10.98.123", "1111111_6.0_3.0_5.0", 4),
-      SpotLDAInput("66.23.45.11", "-1_43_7.0_2.0_6.0", 2),
-      SpotLDAInput("192.168.1.1", "-1_80_6.0_1.0_1.0", 5)))
-
-    val documentDictionary: DataFrame = sparkSession.createDataFrame(documentWordData
-      .map({ case SpotLDAInput(doc, word, count) => doc })
-      .distinct
-      .zipWithIndex.map({ case (d, c) => Row(d, c) }), StructType(List(DocumentNameField, DocumentNumberField)))
-
-    val docTopicDist: RDD[(Long, Vector)] = sparkSession.sparkContext.parallelize(
-      Array((0.toLong, Vectors.dense(0.15, 0.3, 0.5, 0.05)),
-        (1.toLong, Vectors.dense(0.25, 0.15, 0.4, 0.2)),
-        (2.toLong, Vectors.dense(0.4, 0.1, 0.3, 0.2))))
-
-    val sparkDocRes: DataFrame = formatSparkLDADocTopicOutput(docTopicDist, documentDictionary, sparkSession,
-      FloatPointPrecisionUtility32)
-
-    import testImplicits._
-    val documents = sparkDocRes.map({ case Row(documentName: String, docProbabilities: Seq[Float]) => (documentName,
-      docProbabilities)
-    }).collect
-
-    val documentProbabilities = sparkDocRes.select(TopicProbabilityMix).first.toSeq(0).asInstanceOf[Seq[Float]]
-
-    documents should contain("192.168.1.1", Seq(0.15f, 0.3f, 0.5f, 0.05f))
-    documents should contain("10.10.98.123", Seq(0.25f, 0.15f, 0.4f, 0.2f))
-    documents should contain("66.23.45.11", Seq(0.4f, 0.1f, 0.3f, 0.2f))
-
-    documentProbabilities(0) shouldBe a[java.lang.Float]
-  }
-
-  "formatSparkLDAWordOutput" should "return Map[Int,String] after converting word matrix to sequence, wordIDs back " +
-    "to strings, and sequence of probabilities to array" in {
-    val testMat = Matrices.dense(4, 4, Array(0.5, 0.2, 0.05, 0.25, 0.25, 0.1, 0.15, 0.5, 0.1, 0.4, 0.25, 0.25, 0.7, 0.2, 0.02, 0.08))
-
-    val wordDictionary = Map("-1_23.0_7.0_7.0_4.0" -> 3, "23.0_7.0_7.0_4.0" -> 0, "333333.0_7.0_7.0_4.0" -> 2, "80.0_7.0_7.0_4.0" -> 1)
-    val revWordMap: Map[Int, String] = wordDictionary.map(_.swap)
-
-    val sparkWordRes = formatSparkLDAWordOutput(testMat, revWordMap)
-
-    sparkWordRes should contain key ("23.0_7.0_7.0_4.0")
-    sparkWordRes should contain key ("80.0_7.0_7.0_4.0")
-    sparkWordRes should contain key ("333333.0_7.0_7.0_4.0")
-    sparkWordRes should contain key ("-1_23.0_7.0_7.0_4.0")
-  }
 }
\ No newline at end of file


[23/42] incubator-spot git commit: [SPOT-213] [INGEST] adding common functions for hdfs, hive with support for kerberos

Posted by na...@apache.org.
[SPOT-213] [INGEST] adding common functions for hdfs, hive with support for kerberos


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/1582c4c1
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/1582c4c1
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/1582c4c1

Branch: refs/heads/SPOT-181_ODM
Commit: 1582c4c1ad10358d672f0ebef5d2c88ac225c65b
Parents: 13e35fc
Author: natedogs911 <na...@gmail.com>
Authored: Fri Jan 19 09:40:50 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Fri Jan 19 09:40:50 2018 -0800

----------------------------------------------------------------------
 spot-ingest/common/configurator.py | 119 ++++++++++++++++
 spot-ingest/common/hdfs_client.py  | 233 ++++++++++++++++++++++++++++++++
 spot-ingest/common/hive_engine.py  |  73 ++++++++++
 3 files changed, 425 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/1582c4c1/spot-ingest/common/configurator.py
----------------------------------------------------------------------
diff --git a/spot-ingest/common/configurator.py b/spot-ingest/common/configurator.py
new file mode 100644
index 0000000..6fe0ede
--- /dev/null
+++ b/spot-ingest/common/configurator.py
@@ -0,0 +1,119 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import ConfigParser
+from io import open
+
+
+def configuration():
+
+    config = ConfigParser.ConfigParser()
+
+    try:
+        conf = open("/etc/spot.conf", "r")
+    except (OSError, IOError) as e:
+        print("Error opening: spot.conf" + " error: " + e.errno)
+        raise e
+
+    config.readfp(SecHead(conf))
+    return config
+
+
+def db():
+    conf = configuration()
+    return conf.get('conf', 'DBNAME').replace("'", "").replace('"', '')
+
+
+def impala():
+    conf = configuration()
+    return conf.get('conf', 'IMPALA_DEM'), conf.get('conf', 'IMPALA_PORT')
+
+
+def hive():
+    conf = configuration()
+    return conf.get('conf', 'HS2_HOST'), conf.get('conf', 'HS2_PORT')
+
+
+def hdfs():
+    conf = configuration()
+    name_node = conf.get('conf',"NAME_NODE")
+    web_port = conf.get('conf',"WEB_PORT")
+    hdfs_user = conf.get('conf',"HUSER")
+    hdfs_user = hdfs_user.split("/")[-1].replace("'", "").replace('"', '')
+    return name_node,web_port,hdfs_user
+
+
+def spot():
+    conf = configuration()
+    return conf.get('conf',"HUSER").replace("'", "").replace('"', '')
+
+
+def kerberos_enabled():
+    conf = configuration()
+    enabled = conf.get('conf', 'KERBEROS').replace("'", "").replace('"', '')
+    if enabled.lower() == 'true':
+        return True
+    else:
+        return False
+
+
+def kerberos():
+    conf = configuration()
+    if kerberos_enabled():
+        principal = conf.get('conf', 'PRINCIPAL')
+        keytab = conf.get('conf', 'KEYTAB')
+        sasl_mech = conf.get('conf', 'SASL_MECH')
+        security_proto = conf.get('conf', 'SECURITY_PROTO')
+        return principal, keytab, sasl_mech, security_proto
+    else:
+        raise KeyError
+
+
+def ssl_enabled():
+    conf = configuration()
+    enabled = conf.get('conf', 'SSL')
+    if enabled.lower() == 'true':
+        return True
+    else:
+        return False
+
+
+def ssl():
+    conf = configuration()
+    if ssl_enabled():
+        ssl_verify = conf.get('conf', 'SSL_VERIFY')
+        ca_location = conf.get('conf', 'CA_LOCATION')
+        cert = conf.get('conf', 'CERT')
+        key = conf.get('conf', 'KEY')
+        return ssl_verify, ca_location, cert, key
+    else:
+        raise KeyError
+
+
+class SecHead(object):
+    def __init__(self, fp):
+        self.fp = fp
+        self.sechead = '[conf]\n'
+
+    def readline(self):
+        if self.sechead:
+            try:
+                return self.sechead
+            finally:
+                self.sechead = None
+        else:
+            return self.fp.readline()

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/1582c4c1/spot-ingest/common/hdfs_client.py
----------------------------------------------------------------------
diff --git a/spot-ingest/common/hdfs_client.py b/spot-ingest/common/hdfs_client.py
new file mode 100644
index 0000000..5605e9c
--- /dev/null
+++ b/spot-ingest/common/hdfs_client.py
@@ -0,0 +1,233 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from hdfs import InsecureClient
+from hdfs.util import HdfsError
+from hdfs import Client
+from hdfs.ext.kerberos import KerberosClient
+from requests import Session
+from json import dump
+from threading import Lock
+import logging
+import configurator as Config
+from sys import stderr
+
+
+class Progress(object):
+
+    """Basic progress tracker callback."""
+
+    def __init__(self, hdfs_path, nbytes):
+        self._data = {}
+        self._lock = Lock()
+        self._hpath = hdfs_path
+        self._nbytes = nbytes
+
+    def __call__(self):
+        with self._lock:
+            if self._nbytes >= 0:
+                self._data[self._hpath] = self._nbytes
+            else:
+                stderr.write('%s\n' % (sum(self._data.values()), ))
+
+
+class SecureKerberosClient(KerberosClient):
+
+    """A new client subclass for handling HTTPS connections with Kerberos.
+
+    :param url: URL to namenode.
+    :param cert: Local certificate. See `requests` documentation for details
+      on how to use this.
+    :param verify: Whether to check the host's certificate. WARNING: non production use only
+    :param \*\*kwargs: Keyword arguments passed to the default `Client`
+      constructor.
+
+    """
+
+    def __init__(self, url, mutual_auth, cert=None, verify='true', **kwargs):
+
+        self._logger = logging.getLogger("SPOT.INGEST.HDFS_client")
+        session = Session()
+
+        if verify == 'true':
+            self._logger.info('SSL verification enabled')
+            session.verify = True
+            if cert is not None:
+                self._logger.info('SSL Cert: ' + cert)
+                if ',' in cert:
+                    session.cert = [path.strip() for path in cert.split(',')]
+                else:
+                    session.cert = cert
+        elif verify == 'false':
+            session.verify = False
+
+        super(SecureKerberosClient, self).__init__(url, mutual_auth, session=session, **kwargs)
+
+
+class HdfsException(HdfsError):
+    def __init__(self, message):
+        super(HdfsException, self).__init__(message)
+        self.message = message
+
+
+def get_client(user=None):
+    # type: (object) -> Client
+
+    logger = logging.getLogger('SPOT.INGEST.HDFS.get_client')
+    hdfs_nm, hdfs_port, hdfs_user = Config.hdfs()
+    conf = {'url': '{0}:{1}'.format(hdfs_nm, hdfs_port),
+            'mutual_auth': 'OPTIONAL'
+            }
+
+    if Config.ssl_enabled():
+        ssl_verify, ca_location, cert, key = Config.ssl()
+        conf.update({'verify': ssl_verify.lower()})
+        if cert:
+            conf.update({'cert': cert})
+
+    if Config.kerberos_enabled():
+        # TODO: handle other conditions
+        krb_conf = {'mutual_auth': 'OPTIONAL'}
+        conf.update(krb_conf)
+
+    # TODO: possible user parameter
+    logger.info('Client conf:')
+    for k,v in conf.iteritems():
+        logger.info(k + ': ' + v)
+
+    client = SecureKerberosClient(**conf)
+
+    return client
+
+
+def get_file(hdfs_file, client=None):
+    if not client:
+        client = get_client()
+
+    with client.read(hdfs_file) as reader:
+        results = reader.read()
+        return results
+
+
+def upload_file(hdfs_fp, local_fp, overwrite=False, client=None):
+    if not client:
+        client = get_client()
+
+    try:
+        result = client.upload(hdfs_fp, local_fp, overwrite=overwrite, progress=Progress)
+        return result
+    except HdfsError as err:
+        return err
+
+
+def download_file(hdfs_path, local_path, overwrite=False, client=None):
+    if not client:
+        client = get_client()
+
+    try:
+        client.download(hdfs_path, local_path, overwrite=overwrite)
+        return True
+    except HdfsError:
+        return False
+
+
+def mkdir(hdfs_path, client=None):
+    if not client:
+        client = get_client()
+
+    try:
+        client.makedirs(hdfs_path)
+        return True
+    except HdfsError:
+        return False
+
+
+def put_file_csv(hdfs_file_content,hdfs_path,hdfs_file_name,append_file=False,overwrite_file=False, client=None):
+    if not client:
+        client = get_client()
+
+    try:
+        hdfs_full_name = "{0}/{1}".format(hdfs_path,hdfs_file_name)
+        with client.write(hdfs_full_name,append=append_file,overwrite=overwrite_file) as writer:
+            for item in hdfs_file_content:
+                data = ','.join(str(d) for d in item)
+                writer.write("{0}\n".format(data))
+        return True
+
+    except HdfsError:
+        return False
+
+
+def put_file_json(hdfs_file_content,hdfs_path,hdfs_file_name,append_file=False,overwrite_file=False, client=None):
+    if not client:
+        client = get_client()
+
+    try:
+        hdfs_full_name = "{0}/{1}".format(hdfs_path,hdfs_file_name)
+        with client.write(hdfs_full_name,append=append_file,overwrite=overwrite_file,encoding='utf-8') as writer:
+            dump(hdfs_file_content, writer)
+        return True
+    except HdfsError:
+        return False
+
+
+def delete_folder(hdfs_file, user=None, client=None):
+    if not client:
+        client = get_client()
+
+    try:
+        client.delete(hdfs_file,recursive=True)
+    except HdfsError:
+        return False
+
+
+def check_dir(hdfs_path, client=None):
+    """
+    Returns True if directory exists
+    Returns False if directory does not exist
+    : param hdfs_path: path to check
+    : object client: hdfs client object for persistent connection
+    """
+    if not client:
+        client = get_client()
+
+    result = client.list(hdfs_path)
+    if None not in result:
+        return True
+    else:
+        return False
+
+
+def list_dir(hdfs_path, client=None):
+    if not client:
+        client = get_client()
+
+    try:
+        return client.list(hdfs_path)
+    except HdfsError:
+        return {}
+
+
+def file_exists(hdfs_path, file_name, client=None):
+    if not client:
+        client = get_client()
+
+    files = list_dir(hdfs_path, client)
+    if str(file_name) in files:
+        return True
+    else:
+        return False

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/1582c4c1/spot-ingest/common/hive_engine.py
----------------------------------------------------------------------
diff --git a/spot-ingest/common/hive_engine.py b/spot-ingest/common/hive_engine.py
new file mode 100644
index 0000000..eb3d79e
--- /dev/null
+++ b/spot-ingest/common/hive_engine.py
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from impala.dbapi import connect
+import common.configurator as config
+
+
+def create_connection():
+
+    host, port = config.hive()
+    conf = {}
+
+    # TODO: if using hive, kerberos service name must be changed, impyla sets 'impala' as default
+    conf.update({'kerberos_service_name': 'hive'})
+
+    if config.kerberos_enabled():
+        principal, keytab, sasl_mech, security_proto = config.kerberos()
+        conf.update({'auth_mechanism': 'GSSAPI',
+                     })
+    else:
+        conf.update({'auth_mechanism': 'PLAIN',
+                     })
+
+    if config.ssl_enabled():
+        ssl_verify, ca_location, cert, key = config.ssl()
+        if ssl_verify.lower() == 'false':
+            conf.update({'use_ssl': ssl_verify})
+        else:
+            conf.update({'ca_cert': cert,
+                         'use_ssl': ssl_verify
+                         })
+
+    db = config.db()
+    conn = connect(host=host, port=int(port), database=db, **conf)
+    return conn.cursor()
+
+
+def execute_query(query,fetch=False):
+
+    impala_cursor = create_connection()
+    impala_cursor.execute(query)
+
+    return impala_cursor if not fetch else impala_cursor.fetchall()
+
+
+def execute_query_as_list(query):
+
+    query_results = execute_query(query)
+    row_result = {}
+    results = []
+
+    for row in query_results:
+        x=0
+        for header in query_results.description:
+            row_result[header[0]] = row[x]
+            x +=1
+        results.append(row_result)
+        row_result = {}
+
+    return results


[29/42] incubator-spot git commit: close #107

Posted by na...@apache.org.
close #107


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/9215d816
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/9215d816
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/9215d816

Branch: refs/heads/SPOT-181_ODM
Commit: 9215d81689105a6d4b47198654a3acbb735c9716
Parents: 81c371c e15d38c
Author: natedogs911 <na...@gmail.com>
Authored: Tue Jan 23 17:10:17 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Tue Jan 23 17:10:17 2018 -0800

----------------------------------------------------------------------
 LICENSE                                         |  85 +++-
 LICENSE-topojson.txt                            |  26 -
 dev/release/.rat-excludes                       |  11 +
 dev/release/README.md                           | 474 +++++++++++++++++++
 spot-ingest/common/kafka_topic.sh               |   2 +-
 spot-ingest/master_collector.py                 |  60 +--
 spot-ingest/start_ingest_standalone.sh          |   2 +-
 spot-ingest/worker.py                           |  56 ++-
 spot-ml/DATA_SAMPLE.md                          |  57 +++
 spot-ml/ml_ops.sh                               |   2 +-
 spot-ml/ml_test.sh                              |   4 +-
 spot-ml/project/build.properties                |   1 +
 .../dns/model/DNSSuspiciousConnectsModel.scala  |  43 +-
 .../org/apache/spot/lda/SpotLDAHelper.scala     | 173 +++++++
 .../org/apache/spot/lda/SpotLDAModel.scala      | 139 ++++++
 .../org/apache/spot/lda/SpotLDAResult.scala     |  43 ++
 .../org/apache/spot/lda/SpotLDAWrapper.scala    | 226 ++-------
 .../model/FlowSuspiciousConnectsModel.scala     |  27 +-
 .../proxy/ProxySuspiciousConnectsModel.scala    |  25 +-
 .../org/apache/spot/utilities/TopDomains.scala  |   1 -
 .../org/apache/spot/lda/SpotLDAHelperTest.scala | 133 ++++++
 .../apache/spot/lda/SpotLDAWrapperTest.scala    | 236 +++------
 spot-oa/api/resources/flow.py                   |   6 +-
 spot-oa/oa/dns/dns_oa.py                        |   4 +-
 spot-oa/oa/flow/flow_oa.py                      |  43 +-
 spot-oa/requirements.txt                        |   2 +-
 spot-oa/runIpython.sh                           |   2 +-
 spot-setup/hdfs_setup.sh                        |   2 +-
 28 files changed, 1363 insertions(+), 522 deletions(-)
----------------------------------------------------------------------



[32/42] incubator-spot git commit: style fixes argparse

Posted by na...@apache.org.
style fixes argparse


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/51040a2c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/51040a2c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/51040a2c

Branch: refs/heads/SPOT-181_ODM
Commit: 51040a2c112f81ec99873b882c65bf19ba45e1fb
Parents: df86326
Author: tpltnt <tp...@dropcut.net>
Authored: Thu Jan 25 11:32:25 2018 +0100
Committer: tpltnt <tp...@dropcut.net>
Committed: Thu Jan 25 11:33:03 2018 +0100

----------------------------------------------------------------------
 spot-ingest/pipelines/proxy/bluecoat.py | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/51040a2c/spot-ingest/pipelines/proxy/bluecoat.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/proxy/bluecoat.py b/spot-ingest/pipelines/proxy/bluecoat.py
index 5e36a4e..898ff2e 100644
--- a/spot-ingest/pipelines/proxy/bluecoat.py
+++ b/spot-ingest/pipelines/proxy/bluecoat.py
@@ -68,12 +68,18 @@ def main():
     """
     # input Parameters
     parser = argparse.ArgumentParser(description="Bluecoat Parser")
-    parser.add_argument('-zk','--zookeeper',dest='zk',required=True,help='Zookeeper IP and port (i.e. 10.0.0.1:2181)',metavar='')
-    parser.add_argument('-t','--topic',dest='topic',required=True,help='Topic to listen for Spark Streaming',metavar='')
-    parser.add_argument('-db','--database',dest='db',required=True,help='Hive database whete the data will be ingested',metavar='')
-    parser.add_argument('-dt','--db-table',dest='db_table',required=True,help='Hive table whete the data will be ingested',metavar='')
-    parser.add_argument('-w','--num_of_workers',dest='num_of_workers',required=True,help='Num of workers for Parallelism in Data Processing',metavar='')
-    parser.add_argument('-bs','--batch-size',dest='batch_size',required=True,help='Batch Size (Milliseconds)',metavar='')
+    parser.add_argument('-zk', '--zookeeper', dest='zk', required=True,
+                        help='Zookeeper IP and port (i.e. 10.0.0.1:2181)', metavar='')
+    parser.add_argument('-t', '--topic', dest='topic', required=True,
+                        help='Topic to listen for Spark Streaming', metavar='')
+    parser.add_argument('-db', '--database', dest='db', required=True,
+                        help='Hive database whete the data will be ingested', metavar='')
+    parser.add_argument('-dt', '--db-table', dest='db_table', required=True,
+                        help='Hive table whete the data will be ingested', metavar='')
+    parser.add_argument('-w', '--num_of_workers', dest='num_of_workers', required=True,
+                        help='Num of workers for Parallelism in Data Processing', metavar='')
+    parser.add_argument('-bs', '--batch-size', dest='batch_size', required=True,
+                        help='Batch Size (Milliseconds)', metavar='')
     args = parser.parse_args()
 
     # start collector based on data source type.


[12/42] incubator-spot git commit: Merge 'pr/129' to close apache/incubator-spot#129

Posted by na...@apache.org.
Merge 'pr/129' to close apache/incubator-spot#129


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/b5299b54
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/b5299b54
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/b5299b54

Branch: refs/heads/SPOT-181_ODM
Commit: b5299b54a7a294f17d083521a92bc4b4d580a983
Parents: a07b3eb ce70f88
Author: natedogs911 <na...@gmail.com>
Authored: Tue Jan 9 19:16:38 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Tue Jan 9 19:16:38 2018 -0800

----------------------------------------------------------------------
 spot-ingest/master_collector.py | 60 ++++++++++++++++++++----------------
 spot-ingest/worker.py           | 56 ++++++++++++++++++---------------
 2 files changed, 64 insertions(+), 52 deletions(-)
----------------------------------------------------------------------



[31/42] incubator-spot git commit: fixed identation

Posted by na...@apache.org.
fixed identation


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/df86326b
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/df86326b
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/df86326b

Branch: refs/heads/SPOT-181_ODM
Commit: df86326bf991a4f49c57e8aeeb4af0d8b059b21b
Parents: a99404b
Author: tpltnt <tp...@dropcut.net>
Authored: Thu Jan 25 11:30:12 2018 +0100
Committer: tpltnt <tp...@dropcut.net>
Committed: Thu Jan 25 11:30:12 2018 +0100

----------------------------------------------------------------------
 spot-ingest/pipelines/proxy/bluecoat.py | 64 ++++++++++++++--------------
 1 file changed, 32 insertions(+), 32 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/df86326b/spot-ingest/pipelines/proxy/bluecoat.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/proxy/bluecoat.py b/spot-ingest/pipelines/proxy/bluecoat.py
index 1fe02a2..5e36a4e 100644
--- a/spot-ingest/pipelines/proxy/bluecoat.py
+++ b/spot-ingest/pipelines/proxy/bluecoat.py
@@ -28,38 +28,38 @@ from pyspark.sql.types import *
 rex_date = re.compile("\d{4}-\d{2}-\d{2}")
 
 proxy_schema = StructType([
-                                    StructField("p_date", StringType(), True),
-                                    StructField("p_time", StringType(), True),
-                                    StructField("clientip", StringType(), True),
-                                    StructField("host", StringType(), True),
-                                    StructField("reqmethod", StringType(), True),
-                                    StructField("useragent", StringType(), True),
-                                    StructField("resconttype", StringType(), True),
-                                    StructField("duration", IntegerType(), True),
-                                    StructField("username", StringType(), True),
-                                    StructField("authgroup", StringType(), True),
-                                    StructField("exceptionid", StringType(), True),
-                                    StructField("filterresult", StringType(), True),
-                                    StructField("webcat", StringType(), True),
-                                    StructField("referer", StringType(), True),
-                                    StructField("respcode", StringType(), True),
-                                    StructField("action", StringType(), True),
-                                    StructField("urischeme", StringType(), True),
-                                    StructField("uriport", StringType(), True),
-                                    StructField("uripath", StringType(), True),
-                                    StructField("uriquery", StringType(), True),
-                                    StructField("uriextension", StringType(), True),
-                                    StructField("serverip", StringType(), True),
-                                    StructField("scbytes", IntegerType(), True),
-                                    StructField("csbytes", IntegerType(), True),
-                                    StructField("virusid", StringType(), True),
-                                    StructField("bcappname", StringType(), True),
-                                    StructField("bcappoper", StringType(), True),
-                                    StructField("fulluri", StringType(), True),
-                                    StructField("y", StringType(), True),
-                                    StructField("m", StringType(), True),
-                                    StructField("d", StringType(), True),
-                                    StructField("h", StringType(), True)])
+    StructField("p_date", StringType(), True),
+    StructField("p_time", StringType(), True),
+    StructField("clientip", StringType(), True),
+    StructField("host", StringType(), True),
+    StructField("reqmethod", StringType(), True),
+    StructField("useragent", StringType(), True),
+    StructField("resconttype", StringType(), True),
+    StructField("duration", IntegerType(), True),
+    StructField("username", StringType(), True),
+    StructField("authgroup", StringType(), True),
+    StructField("exceptionid", StringType(), True),
+    StructField("filterresult", StringType(), True),
+    StructField("webcat", StringType(), True),
+    StructField("referer", StringType(), True),
+    StructField("respcode", StringType(), True),
+    StructField("action", StringType(), True),
+    StructField("urischeme", StringType(), True),
+    StructField("uriport", StringType(), True),
+    StructField("uripath", StringType(), True),
+    StructField("uriquery", StringType(), True),
+    StructField("uriextension", StringType(), True),
+    StructField("serverip", StringType(), True),
+    StructField("scbytes", IntegerType(), True),
+    StructField("csbytes", IntegerType(), True),
+    StructField("virusid", StringType(), True),
+    StructField("bcappname", StringType(), True),
+    StructField("bcappoper", StringType(), True),
+    StructField("fulluri", StringType(), True),
+    StructField("y", StringType(), True),
+    StructField("m", StringType(), True),
+    StructField("d", StringType(), True),
+    StructField("h", StringType(), True)])
 
 def main():
     """


[41/42] incubator-spot git commit: Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/incubator-spot into pr/134

Posted by na...@apache.org.
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/incubator-spot into pr/134


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/14dbd511
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/14dbd511
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/14dbd511

Branch: refs/heads/SPOT-181_ODM
Commit: 14dbd511d78259dd5104873ec1f3d0caee66e3fe
Parents: f594956 f722127
Author: natedogs911 <na...@gmail.com>
Authored: Mon Mar 19 08:11:53 2018 -0700
Committer: natedogs911 <na...@gmail.com>
Committed: Mon Mar 19 08:11:53 2018 -0700

----------------------------------------------------------------------
 DISCLAIMER                              |  11 ++
 spot-ingest/pipelines/proxy/bluecoat.py | 177 ++++++++++++++++++---------
 spot-oa/api/resources/flow.py           |   6 +-
 spot-oa/oa/flow/flow_oa.py              |  43 ++++---
 4 files changed, 155 insertions(+), 82 deletions(-)
----------------------------------------------------------------------



[34/42] incubator-spot git commit: fixed spot_decoder()

Posted by na...@apache.org.
fixed spot_decoder()


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/8ff0e473
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/8ff0e473
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/8ff0e473

Branch: refs/heads/SPOT-181_ODM
Commit: 8ff0e4730dd05e10bf8908ed7691dd85a4cf35ff
Parents: b5cf634
Author: tpltnt <tp...@dropcut.net>
Authored: Thu Jan 25 11:40:54 2018 +0100
Committer: tpltnt <tp...@dropcut.net>
Committed: Thu Jan 25 11:40:54 2018 +0100

----------------------------------------------------------------------
 spot-ingest/pipelines/proxy/bluecoat.py | 7 +++++++
 1 file changed, 7 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/8ff0e473/spot-ingest/pipelines/proxy/bluecoat.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/proxy/bluecoat.py b/spot-ingest/pipelines/proxy/bluecoat.py
index 54c3b28..5667204 100644
--- a/spot-ingest/pipelines/proxy/bluecoat.py
+++ b/spot-ingest/pipelines/proxy/bluecoat.py
@@ -88,11 +88,17 @@ def main():
 
 
 def spot_decoder(s):
+    """
+    Dummy decoder function.
 
+    :param s: input to decode
+    :returns: s
+    """
     if s is None:
         return None
     return s
 
+
 def split_log_entry(line):
     """
     Split the given line into its fields.
@@ -106,6 +112,7 @@ def split_log_entry(line):
     lex.commenters = ''
     return list(lex)
 
+
 def proxy_parser(proxy_fields):
     """
     Parse and normalize data.


[30/42] incubator-spot git commit: added some docstrings

Posted by na...@apache.org.
added some docstrings


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/a99404b0
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/a99404b0
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/a99404b0

Branch: refs/heads/SPOT-181_ODM
Commit: a99404b05045087bfd02d99f4764df1738959566
Parents: 9215d81
Author: tpltnt <tp...@dropcut.net>
Authored: Thu Jan 25 11:28:26 2018 +0100
Committer: tpltnt <tp...@dropcut.net>
Committed: Thu Jan 25 11:28:26 2018 +0100

----------------------------------------------------------------------
 spot-ingest/pipelines/proxy/bluecoat.py | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/a99404b0/spot-ingest/pipelines/proxy/bluecoat.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/proxy/bluecoat.py b/spot-ingest/pipelines/proxy/bluecoat.py
index 31d89ca..1fe02a2 100644
--- a/spot-ingest/pipelines/proxy/bluecoat.py
+++ b/spot-ingest/pipelines/proxy/bluecoat.py
@@ -62,7 +62,10 @@ proxy_schema = StructType([
                                     StructField("h", StringType(), True)])
 
 def main():
-    
+    """
+    Handle commandline arguments and
+    start the collector.
+    """
     # input Parameters
     parser = argparse.ArgumentParser(description="Bluecoat Parser")
     parser.add_argument('-zk','--zookeeper',dest='zk',required=True,help='Zookeeper IP and port (i.e. 10.0.0.1:2181)',metavar='')
@@ -83,7 +86,12 @@ def spot_decoder(s):
     return s
 
 def split_log_entry(line):
+    """
+    Split the given line into its fields.
 
+    :param line: line to split
+    :returns: list
+    """
     lex = shlex.shlex(line)
     lex.quotes = '"'
     lex.whitespace_split = True
@@ -91,7 +99,12 @@ def split_log_entry(line):
     return list(lex)
 
 def proxy_parser(proxy_fields):
-    
+    """
+    Parse and normalize data.
+
+    :param proxy_fields: list with fields from log
+    :returns: list of str
+    """
     proxy_parsed_data = []
 
     if len(proxy_fields) > 1:
@@ -114,7 +127,9 @@ def proxy_parser(proxy_fields):
 
 
 def save_data(rdd,sqc,db,db_table,topic):
-
+    """
+    Create and save a data frame with the given data.
+    """
     if not rdd.isEmpty():
 
         df = sqc.createDataFrame(rdd,proxy_schema)        


[33/42] incubator-spot git commit: PEP8 fixes main()

Posted by na...@apache.org.
PEP8 fixes main()


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/b5cf6344
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/b5cf6344
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/b5cf6344

Branch: refs/heads/SPOT-181_ODM
Commit: b5cf6344a075889da041341cd4f5d1545ea5c379
Parents: 51040a2
Author: tpltnt <tp...@dropcut.net>
Authored: Thu Jan 25 11:36:10 2018 +0100
Committer: tpltnt <tp...@dropcut.net>
Committed: Thu Jan 25 11:36:10 2018 +0100

----------------------------------------------------------------------
 spot-ingest/pipelines/proxy/bluecoat.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/b5cf6344/spot-ingest/pipelines/proxy/bluecoat.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/proxy/bluecoat.py b/spot-ingest/pipelines/proxy/bluecoat.py
index 898ff2e..54c3b28 100644
--- a/spot-ingest/pipelines/proxy/bluecoat.py
+++ b/spot-ingest/pipelines/proxy/bluecoat.py
@@ -61,6 +61,7 @@ proxy_schema = StructType([
     StructField("d", StringType(), True),
     StructField("h", StringType(), True)])
 
+
 def main():
     """
     Handle commandline arguments and
@@ -83,7 +84,8 @@ def main():
     args = parser.parse_args()
 
     # start collector based on data source type.
-    bluecoat_parse(args.zk,args.topic,args.db,args.db_table,args.num_of_workers,args.batch_size)
+    bluecoat_parse(args.zk, args.topic, args.db, args.db_table, args.num_of_workers, args.batch_size)
+
 
 def spot_decoder(s):
 


[04/42] incubator-spot git commit: Spot-196: Changes: Updated SpotLDAModel, removed unused parameter from apply method.

Posted by na...@apache.org.
Spot-196: Changes:
Updated SpotLDAModel, removed unused parameter from apply method.


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/dbf6f518
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/dbf6f518
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/dbf6f518

Branch: refs/heads/SPOT-181_ODM
Commit: dbf6f518e930fdd8b85e104affe84b93d076c017
Parents: dbdcbaf
Author: Ricardo Barona <ri...@intel.com>
Authored: Mon Aug 7 11:40:33 2017 -0500
Committer: Ricardo Barona <ri...@intel.com>
Committed: Fri Oct 6 15:25:58 2017 -0500

----------------------------------------------------------------------
 spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAModel.scala | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/dbf6f518/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAModel.scala
----------------------------------------------------------------------
diff --git a/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAModel.scala b/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAModel.scala
index 669bb69..181dc62 100644
--- a/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAModel.scala
+++ b/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAModel.scala
@@ -126,11 +126,10 @@ object SpotLDAModel {
     * Factory method, based on instance of ldaModel will generate an object based on DistributedLDAModel
     * implementation or LocalLDAModel.
     *
-    * @param ldaModel
-    * @param spotLDAHelper
+    * @param ldaModel Spark LDAModel
     * @return
     */
-  def apply(ldaModel: LDAModel, spotLDAHelper: SpotLDAHelper = null): SpotLDAModel = {
+  def apply(ldaModel: LDAModel): SpotLDAModel = {
 
     ldaModel match {
       case model: DistributedLDAModel => new SpotDistributedLDAModel(model)


[10/42] incubator-spot git commit: Specify graphql-core version 1.1 support

Posted by na...@apache.org.
Specify graphql-core version 1.1 support

There seem to be backwards incompatibilities with graphql-core 2+ and as result we need to ensure that the version installed is 1.1 for everything to install and work properly.


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/10256f4c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/10256f4c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/10256f4c

Branch: refs/heads/SPOT-181_ODM
Commit: 10256f4c1ea4eed93caf86aa831459e97f4ae9d8
Parents: 341eb02
Author: Tadd Wood <ta...@arcadiadata.com>
Authored: Fri Dec 29 11:48:29 2017 -0600
Committer: Tadd Wood <ta...@arcadiadata.com>
Committed: Fri Dec 29 11:48:29 2017 -0600

----------------------------------------------------------------------
 spot-oa/requirements.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/10256f4c/spot-oa/requirements.txt
----------------------------------------------------------------------
diff --git a/spot-oa/requirements.txt b/spot-oa/requirements.txt
index 9f3afb8..1faa1b6 100644
--- a/spot-oa/requirements.txt
+++ b/spot-oa/requirements.txt
@@ -16,7 +16,7 @@ ipython == 3.2.1
 # GraphQL API dependencies
 flask
 flask-graphql
-graphql-core
+graphql-core == 1.1.0
 urllib3
 
 # API Resources


[15/42] incubator-spot git commit: [SPOT-213] [OA] add kerberos requirements

Posted by na...@apache.org.
[SPOT-213] [OA] add kerberos requirements


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/7376c5e4
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/7376c5e4
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/7376c5e4

Branch: refs/heads/SPOT-181_ODM
Commit: 7376c5e4ef186365dd19581ffefc2cfe015b7529
Parents: 6deaae3
Author: natedogs911 <na...@gmail.com>
Authored: Thu Jan 18 10:48:07 2018 -0800
Committer: natedogs911 <na...@gmail.com>
Committed: Thu Jan 18 10:48:07 2018 -0800

----------------------------------------------------------------------
 spot-oa/kerberos-requirements.txt | 3 +++
 1 file changed, 3 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/7376c5e4/spot-oa/kerberos-requirements.txt
----------------------------------------------------------------------
diff --git a/spot-oa/kerberos-requirements.txt b/spot-oa/kerberos-requirements.txt
new file mode 100644
index 0000000..ee4cae4
--- /dev/null
+++ b/spot-oa/kerberos-requirements.txt
@@ -0,0 +1,3 @@
+thrift_sasl==0.2.1
+sasl
+requests-kerberos
\ No newline at end of file


[09/42] incubator-spot git commit: PEP8 fixes

Posted by na...@apache.org.
PEP8 fixes


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/ce70f882
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/ce70f882
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/ce70f882

Branch: refs/heads/SPOT-181_ODM
Commit: ce70f882b09189ec62d1ad60f4ff2411acb05c2a
Parents: dbf6f51
Author: tpltnt <tp...@dropcut.net>
Authored: Fri Dec 29 16:54:14 2017 +0100
Committer: tpltnt <tp...@dropcut.net>
Committed: Fri Dec 29 17:32:39 2017 +0100

----------------------------------------------------------------------
 spot-ingest/master_collector.py | 60 ++++++++++++++++++++----------------
 spot-ingest/worker.py           | 56 ++++++++++++++++++---------------
 2 files changed, 64 insertions(+), 52 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/ce70f882/spot-ingest/master_collector.py
----------------------------------------------------------------------
diff --git a/spot-ingest/master_collector.py b/spot-ingest/master_collector.py
index 9cd91ea..6f6ff7c 100755
--- a/spot-ingest/master_collector.py
+++ b/spot-ingest/master_collector.py
@@ -21,70 +21,76 @@ import argparse
 import os
 import json
 import sys
+import datetime
 from common.utils import Util
 from common.kerberos import Kerberos
 from common.kafka_client import KafkaTopic
-import datetime 
+
 
 # get master configuration.
-script_path = os.path.dirname(os.path.abspath(__file__))
-conf_file = "{0}/ingest_conf.json".format(script_path)
-master_conf = json.loads(open (conf_file).read())
+SCRIPT_PATH = os.path.dirname(os.path.abspath(__file__))
+CONF_FILE = "{0}/ingest_conf.json".format(SCRIPT_PATH)
+MASTER_CONF = json.loads(open(CONF_FILE).read())
 
 def main():
 
     # input Parameters
     parser = argparse.ArgumentParser(description="Master Collector Ingest Daemon")
-    parser.add_argument('-t','--type',dest='type',required=True,help='Type of data that will be ingested (Pipeline Configuration)',metavar='')
-    parser.add_argument('-w','--workers',dest='workers_num',required=True,help='Number of workers for the ingest process',metavar='')
-    parser.add_argument('-id','--ingestId',dest='ingest_id',required=False,help='Ingest ID',metavar='')
+    parser.add_argument('-t', '--type', dest='type', required=True,
+                        help='Type of data that will be ingested (Pipeline Configuration)',
+                        metavar='')
+    parser.add_argument('-w', '--workers', dest='workers_num',
+                        required=True, help='Number of workers for the ingest process',
+                        metavar='')
+    parser.add_argument('-id', '--ingestId', dest='ingest_id',
+                        required=False, help='Ingest ID', metavar='')
     args = parser.parse_args()
 
     # start collector based on data source type.
-    start_collector(args.type,args.workers_num,args.ingest_id)
+    start_collector(args.type, args.workers_num, args.ingest_id)
 
-def start_collector(type,workers_num,id=None):
+def start_collector(type, workers_num, id=None):
 
     # generate ingest id
-    ingest_id = str(datetime.datetime.time(datetime.datetime.now())).replace(":","_").replace(".","_")
-    
+    ingest_id = str(datetime.datetime.time(datetime.datetime.now())).replace(":", "_").replace(".", "_")
+
     # create logger.
     logger = Util.get_logger("SPOT.INGEST")
 
     # validate the given configuration exists in ingest_conf.json.
-    if not type in master_conf["pipelines"]:
-        logger.error("'{0}' type is not a valid configuration.".format(type));
+    if not type in MASTER_CONF["pipelines"]:
+        logger.error("'{0}' type is not a valid configuration.".format(type))
         sys.exit(1)
 
     # validate the type is a valid module.
-    if not Util.validate_data_source(master_conf["pipelines"][type]["type"]):
-        logger.error("'{0}' type is not configured. Please check you ingest conf file".format(master_conf["pipelines"][type]["type"]));
+    if not Util.validate_data_source(MASTER_CONF["pipelines"][type]["type"]):
+        logger.error("'{0}' type is not configured. Please check you ingest conf file".format(MASTER_CONF["pipelines"][type]["type"]))
         sys.exit(1)
-    
+
     # validate if kerberos authentication is required.
     if os.getenv('KRB_AUTH'):
         kb = Kerberos()
         kb.authenticate()
-    
+
     # kafka server info.
     logger.info("Initializing kafka instance")
-    k_server = master_conf["kafka"]['kafka_server']
-    k_port = master_conf["kafka"]['kafka_port']
+    k_server = MASTER_CONF["kafka"]['kafka_server']
+    k_port = MASTER_CONF["kafka"]['kafka_port']
 
     # required zookeeper info.
-    zk_server = master_conf["kafka"]['zookeper_server']
-    zk_port = master_conf["kafka"]['zookeper_port']
-         
-    topic = "SPOT-INGEST-{0}_{1}".format(type,ingest_id) if not id else id
-    kafka = KafkaTopic(topic,k_server,k_port,zk_server,zk_port,workers_num)
+    zk_server = MASTER_CONF["kafka"]['zookeper_server']
+    zk_port = MASTER_CONF["kafka"]['zookeper_port']
+
+    topic = "SPOT-INGEST-{0}_{1}".format(type, ingest_id) if not id else id
+    kafka = KafkaTopic(topic, k_server, k_port, zk_server, zk_port, workers_num)
 
     # create a collector instance based on data source type.
     logger.info("Starting {0} ingest instance".format(topic))
-    module = __import__("pipelines.{0}.collector".format(master_conf["pipelines"][type]["type"]),fromlist=['Collector'])
+    module = __import__("pipelines.{0}.collector".format(MASTER_CONF["pipelines"][type]["type"]), fromlist=['Collector'])
 
     # start collector.
-    ingest_collector = module.Collector(master_conf['hdfs_app_path'],kafka,type)
+    ingest_collector = module.Collector(MASTER_CONF['hdfs_app_path'], kafka, type)
     ingest_collector.start()
 
-if __name__=='__main__':
+if __name__ == '__main__':
     main()

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/ce70f882/spot-ingest/worker.py
----------------------------------------------------------------------
diff --git a/spot-ingest/worker.py b/spot-ingest/worker.py
index db51def..5c29148 100755
--- a/spot-ingest/worker.py
+++ b/spot-ingest/worker.py
@@ -20,42 +20,48 @@
 import argparse
 import os
 import json
-import logging
 import sys
 from common.utils import Util
 from common.kerberos import Kerberos
 from common.kafka_client import KafkaConsumer
 
-script_path = os.path.dirname(os.path.abspath(__file__))
-conf_file = "{0}/ingest_conf.json".format(script_path)
-worker_conf = json.loads(open (conf_file).read())
+SCRIPT_PATH = os.path.dirname(os.path.abspath(__file__))
+CONF_FILE = "{0}/ingest_conf.json".format(SCRIPT_PATH)
+WORKER_CONF = json.loads(open(CONF_FILE).read())
 
 def main():
 
     # input parameters
     parser = argparse.ArgumentParser(description="Worker Ingest Framework")
-    parser.add_argument('-t','--type',dest='type',required=True,help='Type of data that will be ingested (Pipeline Configuration)',metavar='')
-    parser.add_argument('-i','--id',dest='id',required=True,help='Worker Id, this is needed to sync Kafka and Ingest framework (Partition Number)',metavar='')
-    parser.add_argument('-top','--topic',dest='topic',required=True,help='Topic to read from.',metavar="")
-    parser.add_argument('-p','--processingParallelism',dest='processes',required=False,help='Processing Parallelism',metavar="")
+    parser.add_argument('-t', '--type', dest='type', required=True,
+                        help='Type of data that will be ingested (Pipeline Configuration)',
+                        metavar='')
+    parser.add_argument('-i', '--id', dest='id', required=True,
+                        help='Worker Id, this is needed to sync Kafka and Ingest framework (Partition Number)',
+                        metavar='')
+    parser.add_argument('-top', '--topic', dest='topic', required=True,
+                        help='Topic to read from.', metavar="")
+    parser.add_argument('-p', '--processingParallelism', dest='processes',
+                        required=False, help='Processing Parallelism', metavar="")
     args = parser.parse_args()
 
     # start worker based on the type.
-    start_worker(args.type,args.topic,args.id,args.processes)
+    start_worker(args.type, args.topic, args.id, args.processes)
 
 
-def start_worker(type,topic,id,processes=None):
+def start_worker(type, topic, id, processes=None):
 
     logger = Util.get_logger("SPOT.INGEST.WORKER")
 
     # validate the given configuration exists in ingest_conf.json.
-    if not type in worker_conf["pipelines"]:
-        logger.error("'{0}' type is not a valid configuration.".format(type));
+    if not type in WORKER_CONF["pipelines"]:
+        logger.error("'{0}' type is not a valid configuration.".format(type))
         sys.exit(1)
 
     # validate the type is a valid module.
-    if not Util.validate_data_source(worker_conf["pipelines"][type]["type"]):
-        logger.error("The provided data source {0} is not valid".format(type));sys.exit(1)
+    if not Util.validate_data_source(WORKER_CONF["pipelines"][type]["type"]):
+        logger.error("The provided data source {0} is not valid".format(type))
+        sys.exit(1)
 
     # validate if kerberos authentication is requiered.
     if os.getenv('KRB_AUTH'):
@@ -63,27 +69,27 @@ def start_worker(type,topic,id,processes=None):
         kb.authenticate()
 
     # create a worker instance based on the data source type.
-    module = __import__("pipelines.{0}.worker".format(worker_conf["pipelines"][type]["type"]),fromlist=['Worker'])
+    module = __import__("pipelines.{0}.worker".format(WORKER_CONF["pipelines"][type]["type"]),
+                        fromlist=['Worker'])
 
     # kafka server info.
     logger.info("Initializing kafka instance")
-    k_server = worker_conf["kafka"]['kafka_server']
-    k_port = worker_conf["kafka"]['kafka_port']
+    k_server = WORKER_CONF["kafka"]['kafka_server']
+    k_port = WORKER_CONF["kafka"]['kafka_port']
 
     # required zookeeper info.
-    zk_server = worker_conf["kafka"]['zookeper_server']
-    zk_port = worker_conf["kafka"]['zookeper_port']
+    zk_server = WORKER_CONF["kafka"]['zookeper_server']
+    zk_port = WORKER_CONF["kafka"]['zookeper_port']
     topic = topic
 
     # create kafka consumer.
-    kafka_consumer = KafkaConsumer(topic,k_server,k_port,zk_server,zk_port,id)
+    kafka_consumer = KafkaConsumer(topic, k_server, k_port, zk_server, zk_port, id)
 
     # start worker.
-    db_name = worker_conf['dbname']
-    app_path = worker_conf['hdfs_app_path']
-    ingest_worker = module.Worker(db_name,app_path,kafka_consumer,type,processes)
+    db_name = WORKER_CONF['dbname']
+    app_path = WORKER_CONF['hdfs_app_path']
+    ingest_worker = module.Worker(db_name, app_path, kafka_consumer, type, processes)
     ingest_worker.start()
 
-if __name__=='__main__':
+if __name__ == '__main__':
     main()
-


[03/42] incubator-spot git commit: Incorrect assumption inbound/outbound crashes

Posted by na...@apache.org.
Incorrect assumption inbound/outbound crashes 

The code assumes that there are always inbound only and outbound only connections. This causes saving a scored threat that has either no inbound, or outbound only connections. Fix is simply moving the column header row to outside the check, so the array element is always initialized before being accessed.

Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/3baa75aa
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/3baa75aa
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/3baa75aa

Branch: refs/heads/SPOT-181_ODM
Commit: 3baa75aaef6abdeaef3358acf502751cd5dbe919
Parents: 2ebe572
Author: castleguarders <ca...@users.noreply.github.com>
Authored: Tue Sep 26 14:29:14 2017 -0700
Committer: GitHub <no...@github.com>
Committed: Tue Sep 26 14:29:14 2017 -0700

----------------------------------------------------------------------
 spot-oa/api/resources/flow.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/3baa75aa/spot-oa/api/resources/flow.py
----------------------------------------------------------------------
diff --git a/spot-oa/api/resources/flow.py b/spot-oa/api/resources/flow.py
index ab5105f..418f87c 100755
--- a/spot-oa/api/resources/flow.py
+++ b/spot-oa/api/resources/flow.py
@@ -492,8 +492,8 @@ def create_incident_progression(anchor, inbound, outbound, twoway, date):
     }
 
     #----- Add Inbound Connections-------#
+    obj["children"].append({'name': 'Inbound Only', 'children': [], 'impact': 0})
     if len(inbound) > 0:
-        obj["children"].append({'name': 'Inbound Only', 'children': [], 'impact': 0})
         in_ctxs = {}
         for ip in inbound:
             if 'nwloc' in inbound[ip] and len(inbound[ip]['nwloc']) > 0:
@@ -509,8 +509,8 @@ def create_incident_progression(anchor, inbound, outbound, twoway, date):
                 })
 
     #------ Add Outbound ----------------#
+    obj["children"].append({'name':'Outbound Only','children':[],'impact':0})
     if len(outbound) > 0:
-        obj["children"].append({'name':'Outbound Only','children':[],'impact':0})
         out_ctxs = {}
         for ip in outbound:
             if 'nwloc' in outbound[ip] and len(outbound[ip]['nwloc']) > 0:
@@ -526,8 +526,8 @@ def create_incident_progression(anchor, inbound, outbound, twoway, date):
                 })
 
     #------ Add TwoWay ----------------#
+    obj["children"].append({'name':'two way','children': [], 'impact': 0})
     if len(twoway) > 0:
-        obj["children"].append({'name':'two way','children': [], 'impact': 0})
         tw_ctxs = {}
         for ip in twoway:
             if 'nwloc' in twoway[ip] and len(twoway[ip]['nwloc']) > 0:


[40/42] incubator-spot git commit: corrected data insertion

Posted by na...@apache.org.
corrected data insertion


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/f722127c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/f722127c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/f722127c

Branch: refs/heads/SPOT-181_ODM
Commit: f722127ca410a88355cc3f6d952739845c7499ce
Parents: a166efc
Author: tpltnt <tp...@dropcut.net>
Authored: Mon Feb 5 19:05:13 2018 +0100
Committer: tpltnt <tp...@dropcut.net>
Committed: Mon Feb 5 19:05:13 2018 +0100

----------------------------------------------------------------------
 spot-ingest/pipelines/proxy/bluecoat.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/f722127c/spot-ingest/pipelines/proxy/bluecoat.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/proxy/bluecoat.py b/spot-ingest/pipelines/proxy/bluecoat.py
index 01b9922..d476733 100644
--- a/spot-ingest/pipelines/proxy/bluecoat.py
+++ b/spot-ingest/pipelines/proxy/bluecoat.py
@@ -168,7 +168,7 @@ def save_data(rdd, sqc, db, db_table, topic):
         sqc.setConf("hive.exec.dynamic.partition", "true")
         sqc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
         hive_table = "{0}.{1}".format(db, db_table)
-        df.write.format("parquet").mode("append").insertInto(hive_table)
+        df.write.format("parquet").mode("append").partitionBy('y', 'm', 'd', 'h').insertInto(hive_table)
 
     else:
         print("------------------------LISTENING KAFKA TOPIC:{0}------------------------".format(topic))


[39/42] incubator-spot git commit: added docstring for complete script

Posted by na...@apache.org.
added docstring for complete script


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/a166efc3
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/a166efc3
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/a166efc3

Branch: refs/heads/SPOT-181_ODM
Commit: a166efc306ca041f668c9af7dfd65a94f006e454
Parents: 4ebf4be
Author: tpltnt <tp...@dropcut.net>
Authored: Thu Jan 25 12:45:02 2018 +0100
Committer: tpltnt <tp...@dropcut.net>
Committed: Thu Jan 25 12:45:02 2018 +0100

----------------------------------------------------------------------
 spot-ingest/pipelines/proxy/bluecoat.py | 4 ++++
 1 file changed, 4 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/a166efc3/spot-ingest/pipelines/proxy/bluecoat.py
----------------------------------------------------------------------
diff --git a/spot-ingest/pipelines/proxy/bluecoat.py b/spot-ingest/pipelines/proxy/bluecoat.py
index 541abb5..01b9922 100644
--- a/spot-ingest/pipelines/proxy/bluecoat.py
+++ b/spot-ingest/pipelines/proxy/bluecoat.py
@@ -1,3 +1,7 @@
+"""
+This script adds support for ingesting Bluecoat log files
+into Apache Spot.
+"""
 #
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with