You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spot.apache.org by ev...@apache.org on 2017/03/29 16:52:11 UTC

[47/50] [abbrv] incubator-spot git commit: CSV removal documentation update

CSV removal documentation update


Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/363c02d8
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/363c02d8
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/363c02d8

Branch: refs/heads/SPOT-35_graphql_api
Commit: 363c02d89b2f36d70b94d5e89b25018a58919f8c
Parents: 8bab8f0
Author: LedaLima <le...@apache.org>
Authored: Mon Mar 13 10:04:34 2017 -0600
Committer: Diego Ortiz Huerta <di...@intel.com>
Committed: Wed Mar 15 11:51:23 2017 -0700

----------------------------------------------------------------------
 spot-oa/oa/dns/README.md                        | 118 ++++++------
 spot-oa/oa/dns/ipynb_templates/EdgeNotebook.md  |  64 +++----
 .../dns/ipynb_templates/ThreatInvestigation.md  |  65 ++-----
 spot-oa/oa/flow/README.md                       | 123 +++++++------
 spot-oa/oa/flow/ipynb_templates/EdgeNotebook.md |  69 +++----
 .../flow/ipynb_templates/ThreatInvestigation.md | 181 +++----------------
 spot-oa/oa/proxy/README.md                      | 126 ++++++-------
 7 files changed, 283 insertions(+), 463 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/363c02d8/spot-oa/oa/dns/README.md
----------------------------------------------------------------------
diff --git a/spot-oa/oa/dns/README.md b/spot-oa/oa/dns/README.md
index aab8673..0d37435 100644
--- a/spot-oa/oa/dns/README.md
+++ b/spot-oa/oa/dns/README.md
@@ -1,6 +1,6 @@
 # DNS
 
-DNS sub-module extracts and transforms DNS (Domain Name Service) data already ranked by spot-ml and will load into csv files for presentation layer.
+DNS sub-module extracts and transforms DNS (Domain Name Service) data already ranked by spot-ml and will load it into into impala tables for the presentation layer.
 
 ## DNS Components
 
@@ -15,30 +15,26 @@ DNS spot-oa main script executes the following steps:
 			ipython Notebooks: ipynb/dns/<date>/
 		
 		2. Creates a copy of the notebooks templates into the ipython Notebooks path and renames them removing the "_master" part from the name.
-		
+
 		3. Gets the dns_results.csv from the HDFS location according to the selected date, and copies it back to the corresponding data path.
-		 
+
 		4. Reads a given number of rows from the results file.
 
-		5. Gets the top level domain out of the dns_qry_name, and adds it in the new column 'tld' 
-		 
+		5. Gets the top level domain out of the dns_qry_name, and adds it in the new column 'tld'.
+
 		6. Checks reputation for the query_name of each connection.
-		 
+
 		7. Adds two new columns for the severity of the query_name and the client ip of each connection.
 
 		8. Adds a new column with the hour part of the frame_time.
-		 
-		9. Translates the 'dns_query_class', 'dns_query_type','dns_query_rcode' to human readable text according to the IANA specification. The translated values are stored in the dns_qry_class_name, dns_qry_type_name, dns_qry_rcode_name columns, respectively. 
-		 
+
+		9. Translates the 'dns_query_class', 'dns_query_type','dns_query_rcode' to human readable text according to the IANA specification. The translated values are stored in the dns_qry_class_name, dns_qry_type_name, dns_qry_rcode_name columns, respectively.
+
 		10. Adds Network Context.
-		
-		11. Saves dns_scores.csv file.
-		 
-		12. Creates a backup of dns_scores.csv file named dns_scores_bu.csv.
-		
-		13. Creates dns data details files.
-		
-		14. Creates dendrogram data files.
+
+		11. Saves results to the dns_scores table.
+
+    	12. Generates details and dendrogram diagram data. These details include information about aditional connections to display the details table in the UI.
 
 
 **Dependencies**
@@ -51,9 +47,8 @@ DNS spot-oa main script executes the following steps:
 - [components/data](/spot-oa/oa/components#data)
 - [components/nc](/spot-oa/oa/components#network-context-nc)
 - [components/reputation](/spot-oa/oa/components/reputation)
-- dns_conf.json
-
-
+- dns_conf.json 
+ 
     
 **Prerequisites**
 
@@ -70,12 +65,12 @@ Before running DNS OA users need to configure components for the first time. It
 
 **Output**
 
-- dns_scores.csv: Main results file for DNS OA. This file will contain suspicious connects information and it's limited to the number of rows the user selected when running [oa/start_oa.py](/spot-oa/oa/INSTALL.md#usage).
- 
-		Schema with zero-indexed columns: 
-		
+- DNS suspicious connections. _dns\_scores_ table.
+
+Main results for Flow OA. Main results file for DNS OA. The data stored in this table is limited by the number of rows the user selected when running [oa/start_oa.py](/spot-oa/oa/INSTALL.md#usage). (/spot-oa/oa/INSTALL.md#usage).
+  
 		0.frame_time: string		
-		1.frame_len: int		
+		1.unix_tstamp: bigint		
 		2.ip_dst: string		
 		3.dns_qry_name: string		
 		4.dns_qry_class: string		
@@ -84,49 +79,60 @@ Before running DNS OA users need to configure components for the first time. It
 		7.score: double	
 		8.tld: string		
 		9.query_rep: string		
-		10.hh: string		
-		11.ip_sev: int		
-		12.dns_sev: int		
-		13.dns_qry_class_name: string		
-		14.dns_qry_type_name: string		
-		15.dns_qry_rcode_name: string		
-		16.network_context: string		
-		17.unix_tstamp: bigint
+		10.hh: string	
+		11.dns_qry_class_name: string		
+		12.dns_qry_type_name: string		
+		13.dns_qry_rcode_name: string		
+		14.network_context: string	 
 
-- dns_scores_bu.csv: The backup file of suspicious connects in case user wants to roll back any changes made during analysis. Schema is same as dns_scores.csv.
 
+- DNS details _dns\_scores_ table.  
 
-- dendro-\<DNS query name>.csv: One file for each source IP. This file includes information about all the queries made to a particular DNS query name. The number of retrieved rows is limited by the value of "\_details\_limit" parameter
-
-		Schema with zero-indexed columns:
-		
-		0.dns_a: string		
-		1.dns_qry_name: string		
-		2.ip_dst: string
+One file for each source IP. This file includes information about all the queries made to a particular DNS query name. The number of retrieved rows is limited by the value of "\_details\_limit" parameter
+ 
+		0.unix_tstamp bigint 
+		1.dns_a string
+		2.dns_qry_name string
+		3.ip_dst string 
 
-- edge-\<DNS query name>_\<HH>_00.csv: One file for each DNS query name for each hour of the day. This file contains details for each
-connection between DNS and source IP.
 
-		Schema with zero-indexed columns:
-		
-		0.frame_time: string		
-		1.frame_len: int		
-		2.ip_dst: string		
-		3.ip_src: string		
-		4.dns_qry_name: string		
-		5.dns_qry_class_name: string		
-		6.dns_qry_type_name: string		
-		7.dns_qry_rcode_name: string		
-		8.dns_a: string
+- DNS Details: _dns\_dendro_ table.  
 
+One file for each DNS query name for each hour of the day. This file contains details for each
+connection between DNS and source IP.
+ 
+		0.unix_tstamp bigint
+    	1.frame_len bigint
+    	2.ip_dst string
+    	3.ip_src string
+    	4.ns_qry_name string
+    	5.dns_qry_class string
+    	6.dns_qry_type int
+    	7.dns_qry_rcode int
+    	8.dns_a string
+    	9.hh int
+    	10.dns_qry_class_name string
+    	11.dns_qry_type_name string
+    	12.dns_qry_rcode_name string
+    	13.network_context string
+
+
+- DNS Ingest summary. _dns\_ingest\_summary_ table.
+
+This table is populated with the number of connections ingested by minute during that day.
+
+        Table schema:
+        0. tdate:      string
+        1. total:      bigint 
+ 
 
 ###dns_conf.json
-This file is part of the initial configuration for the DNS pipeline. It will contain mapped all the columns included in the dns_results.csv and dns_scores.csv files.
+This file is part of the initial configuration for the DNS pipeline. It will contain mapped all the columns included in the _dns\_edge_ and _dns\_dendro_ tables.
 
 This file contains three main arrays:
 
 	-  dns_results_fields: Reference of the column name and indexes in the dns_results.csv file.	 
-	-  dns_score_fields:  Reference of the column name and indexes in the dns_scores.csv file.	
+	-  dns_score_fields:  Reference of the column name and indexes in the _dns\_edge_ table.
 	-  add_reputation: According to the dns_results.csv file, this is the column index of the value which will be evaluated using the reputation services.
 
 

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/363c02d8/spot-oa/oa/dns/ipynb_templates/EdgeNotebook.md
----------------------------------------------------------------------
diff --git a/spot-oa/oa/dns/ipynb_templates/EdgeNotebook.md b/spot-oa/oa/dns/ipynb_templates/EdgeNotebook.md
index 6b1579e..f9e8a4a 100644
--- a/spot-oa/oa/dns/ipynb_templates/EdgeNotebook.md
+++ b/spot-oa/oa/dns/ipynb_templates/EdgeNotebook.md
@@ -7,70 +7,50 @@
 
 The following python modules will be imported for the notebook to work correctly:    
 
-        import urllib2  
-        import json  
-        import os  
-        import csv  
+        import urllib2
+        import json
+        import os 
+        import datetime  
+        import subprocess 
         import ipywidgets #For jupyter/ipython >= 1.4  
         from IPython.html import widgets # For jupyter/ipython < 1.4  
         from IPython.display import display, HTML, clear_output, Javascript   
-        import datetime  
-        import subprocess 
-
+        
 
 ###Pre-requisites
-- Execution of the spot-oa process for DNS
-- Correct setup the spot.conf file. [Read more](/wiki/Edit%20Solution%20Configuration) 
-- Have a public key authentication between the current UI node and the ML node. [Read more](/wiki/Configure%20User%20Accounts#configure-user-accounts)
+- Execute hdfs_setup.sh script to create OA tables and setup permissions
+- Correct setup the spot.conf file. [Read more](http://spot.incubator.apache.org/doc/#configuration)
+- Execution of the spot-oa process for Flow
+- Correct installation of the UI [Read more](/ui/INSTALL.md)
 
 
 ##Data source
-
-The whole process in this notebook depends entirely on the existence of `dns_scores.csv` file, which is generated at the OA process.  
-The data is directly manipulated on the .csv files, so a `dns_scores_bu.csv` is created as a backup to allow the user to restore the original data at any point, 
-and this can be performed executing the last cell on the notebook with the following command:  
-
-        !cp $sconnectbu $sconnect
+The whole process in this notebook depends entirely on the existence of `dns_scores` table in the database.  
+The data is manipulated through the graphql api also included in the repository.
 
 
 **Input files**  
-All these paths should be relative to the main OA path.    
-Schema for these files can be found [here](/spot-oa/oa/dns)
-
-        data/dns/<date>/dns_scores.csv  
-        data/dns/<date>/dns_scores_bu.csv
+The data to be processed should be stored in the following tables:
 
-**Temporary Files**
+        dns_scores
+        dns
 
-        data/dns/<date>/score_tmp.csv
 
-**Output files**
+**Output**
+The following tables will be populated after the scoring process:
+        dns_threat_investigation
 
-        data/dns/<date>/dns_scores.csv  (Updated with severity values)
-        data/dns/<date>/dns_scores_fb.csv (File with scored connections that will be used for ML feedback)
 
 ###Functions
 **Widget configuration**
 This is not a function, but more like global code to set up styles and widgets to format the output of the notebook. 
 
-`data_loader():` - This function loads the source file into a csv dictionary reader with all suspicious unscored connections, creating separated lists for 
+`data_loader():` - This function calls the graphql api query *suspicious* to list all suspicious unscored connections, creating separated lists for 
 the 'client_ip' and 'dns_qry_name'.
  Also displays the widgets for the listboxes, textbox, radiobutton list and the 'Score' and 'Save' buttons.  
   
-`fill_list(list_control,source):` - This function loads the given dictionary into a listbox and appends an empty item at the top with the value '--Select--' (Just for design sake)
-
-` assign_score(b):` - This function is executed on the onclick event of the \u2018Score\u2019 button. The system will first try to get the value from the 'Quick search' textbox ignoring the selections from the listboxes; in case the textbox is empty, it will then
- get the selected values from the 'Client IP' and 'Query' listboxes to then search through the dns_scores.csv file to find matching values. 
-A linear search on the file is then performed:  
-The value in the 'Quick Scoring' textbox, will be compared against the `dns_qry_name` column. Partial matches will be considered as a positive match and the `dns_sev` column will be updated to the value selected from the radiobutton list.   
-The column `ip_dst` will be compared against the 'Client IP' selected value; if a match is found, the `ip_sev` column will be updated to the value selected from the radiobutton list.   
-The column `dns_qry_name` will be compared against the 'Query' selected value; if a match is found, the `dns_sev` column will be updated to the value selected from the radiobutton list.     
-Every row will be appended to the `dns_scores_tmp.csv` file. This file will replace the original `dns_scores.csv` at the end of the process.  
-
-Only the scored rows will also be appended to the `dns_scores_fb.csv` file, which will later be used for the ML feedback.
+`fill_list(list_control,source):` - This function loads the given dictionary into a listbox widget
 
-`save(b):` - This event is triggered by the 'Save' button, and executes javascript functions to refresh the data on all the panels in Suspicious Connects. Since the data source file has been updated, the scored connections will be removed from all
-the panels, since those panels will only display connections where the `dns_sev` value is zero.
-This function also removes the widget panel and reloads it again to update the results, removing the need of a manual refresh, and calls the `ml_feedback():` function.
+` assign_score(b):` - This function is executed on the onclick event of the \u2018Score\u2019 button. The system will first try to get the value from the 'Quick search' textbox ignoring the selections from the listboxes; in case the textbox is empty, it will then get the selected values from the 'Client IP' and 'Query' listboxes to append them to a temporary list. 
 
-`ml_feedback():` - A shell script is executed, transferring thru secure copy the _proxy_scores_fb.csv_ file into ML Master node, where the destination path is defined at the spot.conf file.
+`save(b):` - This event is triggered by the 'Save' button, and executes javascript functions to refresh the data on all the panels in Suspicious Connects. This function calls the *score* mutation which updates the score for the selected values in the database.

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/363c02d8/spot-oa/oa/dns/ipynb_templates/ThreatInvestigation.md
----------------------------------------------------------------------
diff --git a/spot-oa/oa/dns/ipynb_templates/ThreatInvestigation.md b/spot-oa/oa/dns/ipynb_templates/ThreatInvestigation.md
index 281ff07..db8bdbd 100644
--- a/spot-oa/oa/dns/ipynb_templates/ThreatInvestigation.md
+++ b/spot-oa/oa/dns/ipynb_templates/ThreatInvestigation.md
@@ -21,60 +21,31 @@ The following python modules will have to be imported for the notebook to work c
 
 ##Pre-requisites  
 - Execution of the spot-oa process for DNS 
-- Score a set connections in the Edge Investigation Notebook
-- Correct setup of the spot.conf file. [Read more](/wiki/Edit%20Solution%20Configuration) 
-
+- Correct installation of the UI [Read more](/ui/INSTALL.md)
+- Score a set connections at the Edge Investigation Notebook 
+- Correct setup the spot.conf file. [Read more](/wiki/Edit%20Solution%20Configuration) 
 
 ##Additional Configuration  
 `top_results` - This value defines the number of rows that will be displayed onscreen after the expanded search. 
 
-
 ##Data source 
-The whole process in this notebook depends entirely on the existence of the scored `dns_scores.csv` file, which is generated at the OA process, and scored at the Edge Investigation Notebook.
- 
-**Input files**
-All these paths should be relative to the main OA path.       
-Schema for these files can be found [here](/spot-oa/oa/DNS)   
-
-        data/dns/<date>/dns_scores.csv  
-
-**Output files**
-
-- threats.csv : Pipe separated file containing the comments saved by the user. This file is updated every time the user adds comments for a new threat. 
-        
-        Schema with zero-indexed columns:
-        
-        0.ip_dst : string
-        0.dns_qry_name : string
-        1.title: string
-        2.description: string
-
-- threat-dendro-\<anchor>.csv : Comma separated file generated in base of the results from the expanded 
-search query. This file includes a list of connections involving the DNS or IP selected from the list. 
-These results are limited to the day under analysis. 
+Data should exists in the following tables:
+        *dns*
+        *dns_threat_investigation*
 
-        
-        Schema with zero-indexed columns:
-
-        0.total: int  
-        1.dns_qry_name: string
-        2.ip_dst: string
-        3.sev: int
-
-
-**HDFS tables consumed**  
-
-        dns
+**Output**
+The following tables will be populated after the threat investigation process:
+        *dns_storyboard*
+        *dns_threat_dendro*
 
 ##FUNCTIONS  
 
 **Widget configuration**
 This is not a function, but more like global code to set up styles and widgets to format the output of the notebook. 
 
-`start_investigation():` - This function cleans the notebook from previous executions, then calls the data_loader() function to obtain the data and afterwards displays the corresponding widgets
+`start_investigation():` - This function cleans the notebook from previous executions.
 
-`data_loader():` - This function loads the _dns_scores.csv_ file into a csv dictionary reader to find all `ip_dst` values where `ip_sev` = 1, and the `dns_qry_name` where `dns_sev` = 1, merging both 
-lists into a dictionary to populate the 'Suspicious DNS' listbox, through the _fill_list()_ function.
+`data_loader():` - , then calls the *threats* query to get the `ip_dst` and `dns_qry_name` values previously scored as high risk, merging both lists into a single dictionary to populate the 'Suspicious DNS' listbox, through the _fill_list()_ function.
 
 `display_controls(ip_list):` - This function will only display the main widget box, containing:
 - "Suspicious URI" listbox
@@ -82,19 +53,15 @@ lists into a dictionary to populate the 'Suspicious DNS' listbox, through the _f
 - Container for the "Threat summary" and "Title" text boxes
 - Container for the "Top N results" HTML table
 
-`fill_list(list_control,source):` - This function populates a listbox widget with the given data dictionary and appends an empty item at the top with the value '--Select--' (Just for visualization  sake)
+`fill_list(list_control,source):` - This function populates a listbox widget with the given data dictionary and appends an empty item at the top with the value '--Select--' (Just for visualization sake)
 
-`search_ip(b):` - This function is triggered by the onclick event of the "Search" button. This will get the selected value from the listbox and perform a query to the _dns_ table to retrieve all comunication involving that IP/Domain during the day with any other IPs or Domains. 
-The output of the query will automatically be stored in the _threat-dendro-&lt;threat&gt;.csv_ file.  
-Afterwards it will read through the output file to display the HTML table, and the results displayed will be limited by the value set in the _top_results_ variable, 
-ordered by amount of connections, listing the most active connections first.
+`search_ip(b):` - This function is triggered by the onclick event of the "Search" button. This calls the graphql *threat / details* query to find additional connections involving the selected IP or query name. 
+The results will be displayed in the HTML table, ordered by amount of connections, listing the most active connections first.
 Here the "display_threat_box()" function will be invoqued. 
 
 `display_threat_box(ip):` - Generates and displays the widgets for "Title" and "Comments" textboxes and the "Save" button on the notebook.
 
 `save_threat_summary(b):` - This function is triggered by the _onclick_ event on the 'Save' button.
  This will take the contents of the form and create/update the _threats.csv_ file.
- 
-`file_is_empty(path):` - Performs a validation to check the file size to determine if it is empty.
- 
+
 `removeWidget(index):` - Javascript function that removes a specific widget from the notebook. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/363c02d8/spot-oa/oa/flow/README.md
----------------------------------------------------------------------
diff --git a/spot-oa/oa/flow/README.md b/spot-oa/oa/flow/README.md
index 4b3dbc1..005da57 100644
--- a/spot-oa/oa/flow/README.md
+++ b/spot-oa/oa/flow/README.md
@@ -1,6 +1,6 @@
 # **Flow OA**
  
-Flow sub-module extracts and transforms Flow data already ranked by spot-ml and will load into csv files for presentation layer.
+Flow sub-module extracts and transforms Flow data already ranked by spot-ml and will load into impala tables for presentation the layer.
 
 ## **Flow OA Components**
 
@@ -13,12 +13,16 @@ Flow spot-oa main script executes the following steps:
                 ipython Notebooks: ipynb/flow/<date>/
 
     2. Creates a copy of iPython notebooks out of templates in ipynb_templates folder into output folder.
+
     3. Reads Flow spot-ml results for a given date and loads only the requested limit.
+
     4. Add network context to source and destination IPs.
+
     5. Add geolocation to source and destination IPs.
-    6. Saves transformed data into a new csv file, this file is called flow_scores.csv.
-    7. Creates details, and chord diagram files. These details include information about each suspicious connection and some additional information
-       to draw chord diagrams.
+
+    6. Stores transformed data in the selected database.
+
+    7. Generates details and chord diagram data. These details include information about aditional connections and some additional information to draw chord diagrams in the UI.
 
 **Dependencies**
 
@@ -48,37 +52,37 @@ Before running Flow OA users need to configure components for the first time. It
 
 **Output**
 
-- flow_scores.csv. Main results file for Flow OA. This file will contain suspicious connects information and it's limited to the number of rows the user selected when running [oa/start_oa.py](/spot-oa/oa/INSTALL.md#usage).
+- Flow suspicious connections. _flow\_scores_ table.  
+
+Main results for Flow OA. The data stored in this table is limited by the number of rows the user selected when running [oa/start_oa.py](/spot-oa/oa/INSTALL.md#usage). [oa/start_oa.py](/spot-oa/oa/INSTALL.md#usage).
        
-        Schema with zero-indexed columns:
-        0.   sev:            int
-        1.   tstart:         string
-        2.   srcIP:          string
-        3.   dstIP:          string
-        4.   sport:          int
-        5.   dport:          int
-        6.   proto:          string
-        7.   ipkt:           bigint
-        8.   ibyt:           bigint
-        9.   opkt:           bigint
-        10.  obyt:           bigint
-        11.  score:          double
-        12.  rank:           int
-        13.  srcIpInternal:  bit
-        14.  destIpInternal: bit
-        15.  srcGeo:         string
-        16.  dstGeo:         string
-        17.  srcDomain:      string
-        18.  dstDomain:      string
-        19.  srcIP_rep:      string
-        20.  dstIP_rep:      string
-
-
-- flow_scores_bu.csv. Backup file for flow_scores.csv in case user needs to roll back the scoring or the changes made during analysis. Schema it's same as flow_scores.csv.
-
-- edge-\<source IP>-\<destination IP>-\<HH>-\<MM>.tsv. Edge files. One for each suspicious connection containing the details for each comunication occurred during the same specific minute between source IP and destination IP.
-
-        Schema with zero-indexed columns:
+        Table schema:
+        0.   tstart:         string
+        1.   srcip:          string
+        2.   dstip:          string
+        3.   sport:          int
+        4.   dport:          int
+        5.   proto:          string
+        6.   ipkt:           int
+        7.   ibyt:           int
+        8.   opkt:           int
+        9.   obyt:           int
+        10.  score:          float
+        11.  rank:           int
+        12.  srcip_internal:  bit
+        13.  destip_internal: bit
+        15.  src_geoloc:         string
+        16.  dst_geoloc:         string
+        17.  src_domain:      string
+        18.  dst_domain:      string
+        19.  src_rep:      string
+        20.  dst_rep:      string
+
+-  Flow details. _flow\_edge_ table.
+
+A query will be executed for each suspicious connection detected, to find the details for each connection occurred during the same specific minute between given source IP and destination IP.
+
+        Table schema:
         0.  tstart:     string
         1.  srcip:      string
         2.  dstip:      string
@@ -87,30 +91,47 @@ Before running Flow OA users need to configure components for the first time. It
         5.  proto:      string
         6.  flags:      string
         7.  tos:        int
-        8.  bytes:      bigint
-        9.  pkts:       bigint
-        10. input:      int
-        11. output:     int
-        12. rip:        string
-
-- chord-\<client ip>.tsv. Chord files. One for each distinct client ip. These files contain the sum of input packets and bytes transferred between the client ip and every other IP it connected to.
-
-        Schema with zero-indexed columns:
-        0.  srcip:      string
-        1.  dstip:      string
-        2.  ibytes:     bigint
-        3.  ipkts:      double
-        
+        8.  ibyt:       bigint
+        9.  ipkt:       bigint
+        10.  pkts:      bigint
+        11. input:      int
+        12. output:     int
+        13. rip:        string
+        14. obyt:       bigint
+        15. opkt:       bigint
+        16. hh:         int
+        17. md:         int         
+
+- Flow Chord Diagrams.  _flow\_chords_ table.
+
+A query will be executed for each distinct client ip that has connections to 2 or more other suspicious IP. This query will retrieve the sum of input packets and bytes transferred between the client ip and every other suspicious IP it connected to.
+
+        Table schema:
+        0. ip_threat:  string
+        1. srcip:      string
+        2. dstip:      string
+        3. ibyt:       bigint
+        4. ipkt:       bigint
+
+
+- Flow Ingest summary. _flow\_ingest\_summary_ table.
+
+This table is populated with the number of connections ingested by minute during that day.
+
+        Table schema:
+        0. tdate:      string
+        1. total:      bigint 
+
+
 ### flow_config.json
 
 Flow spot-oa configuration. Contains columns name and index for input and output files.
 This Json file contains 3 main arrays:
    
     - flow_results_fields: list of column name and index of ML flow_results.csv file. Flow OA uses this mapping to reference columns by name.
-    - column_indexes_filter: the list of indices to take out of flow_results_fields for OA process. 
+    - column_indexes_filter: the list of indices to take out of flow_results_fields for the OA process. 
     - flow_score_fields: list of column name and index for flow_scores.csv. After the OA process completes more columns are added.
-        
-
+    
 
 ### ipynb_templates
 Templates for iPython notebooks.

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/363c02d8/spot-oa/oa/flow/ipynb_templates/EdgeNotebook.md
----------------------------------------------------------------------
diff --git a/spot-oa/oa/flow/ipynb_templates/EdgeNotebook.md b/spot-oa/oa/flow/ipynb_templates/EdgeNotebook.md
index 272e2bc..973a463 100644
--- a/spot-oa/oa/flow/ipynb_templates/EdgeNotebook.md
+++ b/spot-oa/oa/flow/ipynb_templates/EdgeNotebook.md
@@ -8,56 +8,53 @@
 
 The following python modules will be imported for the notebook to work correctly:    
 
+        import datetime
         import struct, socket
         import shutil
         import numpy as np
         import pandas as pd
         import linecache, bisect
-        import csv
+        import csv, json
         import operator
-        import os, time, subprocess
+        import os, time, subprocess 
+        from collections import OrderedDict
         import ipywidgets #For jupyter/ipython >= 1.4  
         from IPython.html import widgets #For jupyter/ipython < 1.4  
-        from IPython.display import display, HTML, clear_output, Javascript   
+        from IPython.display import display, Javascript, clear_output   
 
 
 ###Pre-requisites
+- Execute hdfs_setup.sh script to create OA tables and setup permissions
+- Correct setup the spot.conf file. [Read more](http://spot.incubator.apache.org/doc/#configuration)
 - Execution of the spot-oa process for Flow
-- Correct setup the spot.conf file. [Read more](/wiki/Edit%20Solution%20Configuration)
-- Have a public key created between the current UI node and the ML node. [Read more](/wiki/Configure%20User%20Accounts#configure-user-accounts)
+- Correct installation of the UI [Read more](/ui/INSTALL.md)
 
 
-##Additional Configuration
+##Additional Configuration inside the notebook
 `coff` - This value defines the max number of records used to populate the listbox widgets. This value is set by default on 250.
 `nwloc` - File name of the custom network context.  
 
-###Data source
-The whole process in this notebook depends entirely on the existence of `flow_scores.csv` file, which is generated at the OA process at the path.  
-The data is directly manipulated on the .csv files, so a `flow_scores_bu.csv` on the same path is created as a backup to allow the user to restore the original data at any point, 
-and this can be performed executing the last cell on the notebook with the following command:  
-
-        !cp $sconnectbu $sconnect
 
+###Data source
+The whole process in this notebook depends entirely on the existence of `flow_scores` table in the database.  
+The data is manipulated through the graphql api also included in the repository.
 
-**Input files**  
-All these paths should be relative to the main OA path.    
-Schema for these files can be found [here](/spot-oa/oa/flow)
 
-        data/flow/<date>/flow_scores.csv
-        data/flow/<date>/flow_scores_bu.csv
+**Input**  
+The data to be processed should be stored in the following tables:
 
-**Temporary Files**
+        flow_scores
+        flow
 
-        data/flow/<date>/flow_scores.csv.tmp
 
-**Output files**
+**Output**
+The following tables will be populated after the scoring process:
+        flow_threat_investigation
 
-        data/flow/<date>/flow_scores.csv (Updated with severity values)
-        data/flow/<date>/flow_scores_fb.csv (File with scored connections that will be used for ML feedback)
 
 ##Functions 
  
-`displaythis():` - This function reads the `flow_scores.csv` file to list all suspicious unscored connections, creating separated lists for:
+`data_loader():` - This function calls the graphql api query *suspicious* to list all suspicious unscored connections, creating separated lists for:
 - Source IP
 - Destination IP
 - Source port
@@ -69,29 +66,7 @@ Each of these lists will populate a listbox widget and then they will be display
 
 `update_sconnects(b):` -   
 This function is executed on the onclick event of the \u2018Assign\u2019 button. The notebook will first try to get the value from the 'Quick IP Scoring' textbox ignoring the selections from the listboxes; in case the textbox is empty, it will then
- get the selected values from each of the listboxes to look them up in the `flow_scores.csv` file. 
-A binary search on the file is then performed:  
-- The value in the 'Quick IP Scoring' textbox, will be compared against the `ip_src` and `ip_dst` columns; if either column is a match, the `sev` column will be updated with the value selected from the radiobutton list. 
-- The column `srcIP` will be compared against the 'Source IP' selected value.  
-- The column `dstIP` will be compared against the 'Dest IP' selected value. 
-- The column `sport` will be compared against the 'Src Port' selected value.
-- The column `dport` will be compared against the 'Dst Port' selected value.  
-
-Every row will be then appended to the `flow_scores.csv.tmp` file, which will replace the original `flow_scores.csv` at the end of the process.
-The scored rows will also be appended to the `flow_scores_fb.csv` file, which will later be used for the ML feedback.   
-
-`set_rules():` - Predefined function where the user can define custom rules to be initally applied to the dataset. By default this function is commented out.
-
-`create_feedback_file(scored_rows):` - Appends the updated rows to the _flow_scores_fb.csv_ everytime a connection is scored. This file is used as feedback for the ML process.
-
-`apply_rules(rops,rvals,risk):` - This function applies the rules defined by `set_rules()` and updates the `flow_scores.csv` file following a similar process to the `update_sconnects()` function. By default this function is commented out.
-
-`attack_heuristics():` - This function is executed at the start, and loads the data from `flow_scores.csv` into a pandas dataframe grouped by `srcIp` column,
-to then print only those IP's that connect to more than 20 other different IP's. By default this function is commented out.
+ get the selected values from each of the listboxes and append them to a temporary list. 
 
 `savesort(b):` - This event is triggered by the 'Save' button, and executes javascript functions to refresh the data on all the panels in Suspicious Connects.  
-This function also reorders the _flow_scores.csv_ file by moving all scored connections to the end of the file and sorting the remaining connections by `lda_score` column.    
-Finally, removes the widget panel and reloads it again to update the results, removing the need of a manual refresh, and calls the `ml_feedback():` function.    
-
-`ml_feedback():` - A shell script is executed, transferring thru secure copy the _flow_scores_fb.csv_ file into ML Master node, where the destination path is defined at the spot.conf file.
-   
\ No newline at end of file
+This function calls the *score* mutation which updates the score for the selected values in the database.

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/363c02d8/spot-oa/oa/flow/ipynb_templates/ThreatInvestigation.md
----------------------------------------------------------------------
diff --git a/spot-oa/oa/flow/ipynb_templates/ThreatInvestigation.md b/spot-oa/oa/flow/ipynb_templates/ThreatInvestigation.md
index fdbcad0..787d955 100644
--- a/spot-oa/oa/flow/ipynb_templates/ThreatInvestigation.md
+++ b/spot-oa/oa/flow/ipynb_templates/ThreatInvestigation.md
@@ -7,6 +7,7 @@
 
 The following python modules will have to be imported for the notebook to work correctly:  
 
+        import datetime
         import struct, socket
         import numpy as np
         import linecache, bisect
@@ -14,6 +15,7 @@ The following python modules will have to be imported for the notebook to work c
         import operator
         import json
         import os
+        import pandas as pd
         import ipywidgets as widgets # For jupyter/ipython >= 1.4
         from IPython.html import widgets
         from IPython.display import display, Javascript, clear_output
@@ -21,149 +23,46 @@ The following python modules will have to be imported for the notebook to work c
 
 ##Pre-requisites  
 - Execution of the spot-oa process for Flow
+- Correct installation of the UI [Read more](/ui/INSTALL.md)
 - Score a set connections at the Edge Investigation Notebook 
 - Correct setup the spot.conf file. [Read more](/wiki/Edit%20Solution%20Configuration) 
 - Include a comma separated network context file. **Optional** [Schema](/spot-oa/oa/components/README.md#network-context-nc)
-- Include a geolocation database file. [Schema](/spot-oa/oa/components/README.md#geoloc)   
+- Include a geolocation database file.  **Optional** [Schema](/spot-oa/oa/components/README.md#geoloc)   
 
 
-##Additional Configuration
+##Additional Configuration inside the notebook
 `top_results` - This value defines the number of rows that will be displayed onscreen after the expanded search. 
 This also affects the number of IPs that will appear in the Timeline chart.
 
 
 ##Data source  
-The whole process in this notebook depends entirely on the existence of the scored _flow_scores.csv_ file, which is generated at the OA process, and scored at the Edge Investigation Notebook.
+Data should exists in the following tables:
+        *flow*
+        *flow_threat_investigation*
+
 
 **Input files**  
 All these paths should be relative to the main OA path.    
 Schema for these files can be found here:
-
-[flow_scores.csv](/spot-oa/oa/flow)  
+ 
 [iploc.csv](/spot-oa/oa/components/README.md#geoloc)  
 [networkcontext_1.csv](/spot-oa/oa/components/README.md#network-context-nc)  
   
-
-        data/flow/<date>/flow_scores.csv  
+ 
         context/iploc.csv
         context/networkcontext_1.csv
  
 
-**Output files**  
-- threats.csv : Pipe separated file containing the comments saved by the user. This file is updated every time the user adds comments for a new threat. 
-        
-        Schema with zero-indexed columns:
-        
-        0.ip: string
-        1.title: string
-        2.description: string
- 
-- sbdet-\<ip>.tsv : Tab separated file, this file lists all the client IP's that connected to the IP under investigation, including: 
-the duration of the connection, response code and exact date and time of each the connection.  
-        Schema with zero-indexed columns:
-        
-        0.tstart: string
-        1.tend: string
-        2.srcip	: string
-        3.dstip : string
-        4.proto : string
-        5.sport : string
-        6.dport : string
-        7.pkts : string
-        8.bytes : string
-
-
-- globe-\<ip>.json : Json file including the geolocation of all the suspicious connections related to the IP under investigation. 
-                Schema:
-
-                {
-                        "destips": [{
-                                "geometry": {
-                                        "type": "Point",
-                                        "coordinates": <[Lat, Long] values from the geolocation database>
-                                },
-                                "type": "Feature",
-                                "properties": {
-                                        "ip": "<dst IP>",
-                                        "type": <1 for Inbound, 2 for Outbound, 3 for Two way>,
-                                        "location": "<Host name provided by the geolocation db>"
-                                },
-                                ......
-                        }
-                        }],
-                        "type": "FeatureCollection",
-                        "sourceips": [{
-                                "geometry": {
-                                        "type": "Point",
-                                        "coordinates": <[Lat, Long] values from the geolocation database>
-                                },
-                                "type": "Feature",
-                                "properties": {
-                                        "ip": "<src ip>",
-                                        "type": <1 for Inbound, 2 for Outbound, 3 for Two way>,
-                                        "location": "<Host name provided by the geolocation db>"
-                                },
-                                ......
-                        }]
-                }
- 
-
-- stats-\<ip>.json: Json file containing the count of connections of each kind made to/from the suspicious IP.
-                Schema:
-
-                {
-                        "size": <total of connections>,
-                        "name": "<Name of suspicious IP, according to the network context",
-                        "children": [{
-                                "size": <Total number of Inbound connections>,
-                                "name": "Inbound Only", 
-                                "children": 
-                                        [{
-                                                "name": "<Context name>",
-                                                "size": <Number of connections>
-                                        }, ...
-                                        ]
-                                },
-                                {"size": <Total number of Outbound connections>,
-                                 "name": "Outbound Only", 
-                                 "children": 
-                                        [{
-                                                "name": "<Context name>",
-                                                "size": <Number of connections>
-                                        }, ...
-                                        ]
-                                }, 
-                                {"size": <Total number of Two way connections>,
-                                 "name": "two way",
-                                 "children":
-                                        [{
-                                                "name": "<Context name>",
-                                                "size": <Number of connections>
-                                        }, ...
-                                        ]
-                                }]
-                        }
-  
-
- - threat-dendro-\<ip>.json : Json file including the breakdown of the connections performed by the suspicious IP.  
-
-                Schema: 
-
-                {"time": "date in YYYYMMDD format>",
-                 "name": "<suspicious IP>",
-                 "children": [{
-                        "impact": 0,
-                        "name": "<Type of connections>", 
-                        "children": [
-                                <Individual connections named after the network context>
-                                ]
-                        }]
-                }
+**Output**  
+The following tables will be populated after the threat investigation process:
+        *flow_storyboard*
+        *flow_timeline*
 
-**HDFS tables consumed**  
-
-                flow
-   
+The following files will be created and stored in HDFS.
+ 
+        globe-\<ip>.json
+        stats-\<ip>.json:
+        threat-dendro-\<ip>.json
 
 ##FUNCTIONS  
 
@@ -171,19 +70,13 @@ the duration of the connection, response code and exact date and time of each th
 
 This is not a function, but more like global code to set up styles and widgets to format the output of the notebook.   
 
-`start_investigation():` - This function cleans the notebook from previous executions, then loops through the _flow_scores.csv_ file to get the 'srcIp' and 'dstIP' values from connections scored as high risk (sev = 1), ignoring IPs
-already saved in the _threats.csv_ file. 
+`start_investigation():` - This function cleans the notebook from previous executions, and calls the *threats* query to get the source and destination IP's previously scored as high risk. 
 
 `display_controls(threat_list):` - This function will display the ipython widgets with the listbox of high risk IP's and the "Search" button.
 
-`search_ip()` - This function is triggered by the onclick event of the "Search" button after selecting an IP from the listbox. This will perform a query to the _flow_ table to find all connections involving the selected IP.
- The results are stored in the _ir-\<ip>.tsv_ file. If the file is not empty, this will immediately execute the following functions:  
- - get_in_out_and_twoway_conns()
- - add_geospatial_info()
- - add_network_context() 
- - display_threat_box()
+`search_ip()` - This function is triggered by the onclick event of the "Search" button after selecting an IP from the listbox. This calls the graphql *threat / details* query to find additional connections involving the selected IP. 
 
-`get_in_out_and_twoway_conns():` - With the data from the _ir-\<ip>.tsv_ file, this function will loop through each connection and store it into one of three dictionaries:
+`get_in_out_and_twoway_conns():` - With the data from the previous method, this function will loop through each connection and store it into one of three dictionaries:
 - All unique \u2018inbound\u2019 connected IP's (Where the internal sought IP appears only as destination, or the opposite if the IP is external)  
 - All unique \u2018outbound\u2019 connected IP's (Where the internal sought IP appears only as source, or the opposite if the IP is external)
 - All unique \u2018two way\u2019 connected IP's (Where the sought IP appears as both source and destination)
@@ -197,18 +90,7 @@ To aid on the analysis, this function displays four html tables each containing
 
 `display_threat_box(ip):` - Displays the widgets for "Title", "Comments" textboxes and the "Save" button on the notebook, so the user can add comments related to the threat and save them to continue with the analysis.  
 
-`add_network_context()` - This function depends on the existence of the _networkcontext\_1.csv_ file, otherwise this step will be skipped.
-This function will loop through all dictionaries updating each IP with its context depending on the ranges defined in the networkcontext.
-
-`add_geospatial_info()` - This function depends on the existence of the _iploc.csv_ file. This will read through the dictionaries created, looking for every IP and updating its geolocation data according to the iploc database. If the iploc file doesn't exist, this function will be skipped.
-
-`save_threat_summary()` - This function is triggered by the onclick event of the "Save" button. Removes the widgets and cleans the notebook from previous executions, removes the selected value from the listbox widget and 
- executes each of the following functions to create the data source files for the storyboard:
-- generate_attack_map_file()
-- generate_stats()
-- generate_dendro()
-- details_inbound()
-- add_threat() 
+`save_threat_summary()` - This function is triggered by the onclick event of the "Save" button. Removes the widgets and cleans the notebook from previous executions, removes the selected value from the listbox widget and executes the *createStoryboard* mutation to save the data for the storyboard.
 
 `display_results(cols, dataframe, top)` - 
 *cols*: List of columns to display from the dataframe
@@ -216,23 +98,8 @@ This function will loop through all dictionaries updating each IP with its conte
 *top*: Number of top rows to display.
 This function will create a formatted html table to display the provided dataframe.
 
-`generate_attack_map_file(ip, inbound, outbound, twoway): `- This function depends on the existence of the _iploc.csv_ file. Using the geospatial info previously added to the dictionaries, this function will create the _globe.json_ file. If the iploc file doesn't exist, this function will be skipped.
-
-`generate_stats(ip, inbound, outbound, twoway, threat_name):` - This function reads through each of the dictionaries to group the connections by type. The results are stored in the _stats-&lt;ip&gt;.json_ file. 
-
-`generate_dendro(ip, inbound, outbound, twoway, date):` - This function groups the results from all three dictionaries into a json file, adding additionals level if the dictionaries include network context for each IP. 
-The results are stored in the _threat-dendro-&lt;ip&gt;.json_ file.
-
-`details_inbound(anchor, inbound, outbond, twoway):` -  This function executes a query to the _flow_ table looking for all additional information between the shought IP (threat) and the IP's in the 'top_n' dictionaries. The results will be stored in the _sbdet-&lt;ip&gt;.tsv_ file.
- 
-`add_threat(ip,threat_title):`- Creates or updates the _threats.csv_ file, appending the IP and Title from the web form. This will serve as the menu for the Story Board.
-
 `get_top_bytes(conns_dict, top):` - Orders a dictionary descendent by number of bytes, returns a dictionary with the top 'n' values. This dictionary will be printed onscreen, listing the most active connections first.   
 
 `get_top_conns(conns_dict, top):` - Orders a dictionary descendent by number of connections executed, returns a dictionary with the top 'n' values. This dictionary will be printed onscreen, listing the most active connections first.   
 
-`file_is_empty(path):` - Performs a validation to check the file of a size to determine if it is empty.
- 
 `removeWidget(index):` - Javascript function that removes a specific widget from the notebook.
- 
-`get_ctx_name(full_context): ` **Deprecated**    

http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/363c02d8/spot-oa/oa/proxy/README.md
----------------------------------------------------------------------
diff --git a/spot-oa/oa/proxy/README.md b/spot-oa/oa/proxy/README.md
index ac6f73c..110a209 100644
--- a/spot-oa/oa/proxy/README.md
+++ b/spot-oa/oa/proxy/README.md
@@ -1,6 +1,6 @@
 # PROXY
 
-Proxy sub-module will extract and transform Proxy data already ranked by spot-ml and will load into csv files for presentation layer.
+Proxy sub-module will extract and transform Proxy data already ranked by spot-ml and will load it into impala tables for the presentation layer.
 
 ## Proxy Components
 
@@ -13,7 +13,6 @@ Proxy spot-oa main script executes the following steps:
 			data: data/proxy/<date>/
 			ipython Notebooks: ipynb/proxy/<date>/
 		
-		
 		2. Creates a copy of the notebooks templates into the ipython Notebooks path and renames them removing the "_master" part from the name.
 		
 		3. Gets the proxy_results.csv from the HDFS location according to the selected date, and copies it back to the corresponding data path.
@@ -30,11 +29,9 @@ Proxy spot-oa main script executes the following steps:
 		
 		9. Creates a hash for every full_uri + clientip pair to use as filename.  
 		 
-		10. Saves proxy_scores.tsv file.
-		 
-		11. Creates a backup of proxy_scores.tsv file.
+		10. Saves the data in the _proxy_\scores_ table. 
 		
-		12. Creates proxy data details files. 
+    	12. Collects information about aditional connections to display the details table in the UI.
 
 
 **Dependencies**
@@ -60,63 +57,70 @@ Before running Proxy OA, users need to configure components for the first time.
 
 **Output**
 
-- proxy_scores.tsv: Main results file for Proxy OA. This file is tab separated and it's limited to the number of rows the user selected when running [oa/start_oa.py](/spot-oa/oa/INSTALL.md#usage).
-
-		Schema with zero-indexed columns: 
-
-		0.p_date: string 
-		1.p_time: string 
-		2.clientip: string 
-		3.host: string 
-		4.reqmethod: string
-		5.useragent: string
-		6.resconttype: string
-		7.duration: int
-		8.username: string 
-		9.webcat: string 
-		10.referer: string 
-		11.respcode: string 
-		12.uriport: string 
-		13.uripath: string
-		14.uriquery: string 
-		15.serverip: string
-		16.scbytes: int
-		17.csbytes: int
-		18.fulluri: string
-		19.word: string
-		20.score: string 
-		21.uri_rep: string
-		22.uri_sev: string 
-		23.respcode_name: string 
-		24.network_context: string
-		25.hash: string
-
-
-- proxy_scores_bu.tsv: The backup file of suspicious connects in case user want to roll back any changes made during analysis. Schema is same as proxy_scores.tsv.
-     
-
-- edge-clientip-\<hash>HH.tsv: One file for each fulluri + clientip connection for each hour of the day.
-
-		Schema with zero-indexed columns:
-
-		0.p_date: string
-		1.p_time: string
-		2.clientip: string
-		3.host: string
-		4.webcat: string
-		5.respcode: string
-		6.reqmethod: string
-		7.useragent: string
-		8.resconttype: string
-		9.referer: string
-		10.uriport: string
-		11.serverip: string
-		12.scbytes: int
-		13.csbytes: int
-		14.fulluri: string
+- Proxy suspicious connections. _proxy\_scores_ table.
+
+Main results file for Proxy OA. The data stored in this table is limited by the number of rows the user selected when running [oa/start_oa.py](/spot-oa/oa/INSTALL.md#usage).
+ 
+		0.tdate string
+		1.time string
+		2.clientip string
+		3.host string
+		4.reqmethod string
+		5.useragent string
+		6.resconttype string
+		7.duration int
+		8.username string
+		9.webcat string
+		10.referer string
+		11.respcode string
+		12.uriport string
+		13.uripath string
+		14.uriquery string
+		15.serverip string
+		16.scbytes int
+		17.csbytes int
+		18.fulluri string
+		19.word string
+		20.ml_score Float
+		21.uri_rep string
+		22.respcode_name string
+		23.network_context string 
+
+
+- Proxy details. _proxy\_edge_ table.
+
+A query will be executed for each fulluri + clientip connection for each hour of the day.
+ 
+		0.tdate STRING
+		1.time STRING
+		2.clientIp STRING
+		3.host string
+		4.webcat string
+		5.respcode string
+		6.reqmethod string
+		7.useragent string
+		8.resconttype string
+		9.referer string
+		10.uriport string
+		11.serverip string
+		12.scbytes int
+		13.csbytes int
+		14.fulluri string
+		15.hh int
+		16.respcode_name string
+
+
+- Proxy Ingest summary. _proxy\_ingest\_summary_ table.
+
+This table is populated with the number of connections ingested by minute during that day.
+
+        Table schema:
+        0. tdate:      string
+        1. total:      bigint 
+
 
 ###proxy_conf.json
-This file is part of the initial configuration for the proxy pipeline It will contain mapped all the columns included in the proxy_results.csv and proxy_scores.tsv files.
+This file is part of the initial configuration for the proxy pipeline It will contain mapped all the columns included in the proxy_results.csv and proxy tables.
 
 This file contains three main arrays: