You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spot.apache.org by na...@apache.org on 2017/10/18 17:50:38 UTC
[4/6] incubator-spot git commit: cleanup before merge of pr 121
http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/92af7b69/docs/open-data-model.md
----------------------------------------------------------------------
diff --git a/docs/open-data-model.md b/docs/open-data-model.md
deleted file mode 100644
index 292d586..0000000
--- a/docs/open-data-model.md
+++ /dev/null
@@ -1,892 +0,0 @@
-**Open Data Model (ODM)**
-=========================
-
-**Overview**
-------------
-
-This document describes a strategy for creating an open data model (ODM) for
-Apache Spot (incubating) (formerly known as “Open Network Insight (ONI)”) in
-support of cyber security analytic use cases. It also describes the use cases
-for which Apache Spot (incubating) running on the Cloudera platform is uniquely
-capable of addressing along with the data model.
-
-**Apache Spot (incubating) Open Data Model Strategy**
------------------------------------------------------
-
-The Apache Spot (incubating) Open Data Model (ODM) strategy aims to extend
-Apache Spot (incubating) capabilities to support a broader set of cyber security
-use cases than initially supported. The primary use case initially supported by
-Apache Spot (incubating) includes Network Traffic Analysis for network flows
-(Netflow, sflow, etc.), DNS and Proxy; primarily the identification of threats
-through anomalous event detection using both supervised and unsupervised machine
-learning.
-
-In order to support a broader set of use cases, Spot must be extended to collect
-and analyze other common “event-oriented” data sources analyzed for cyber
-threats, including but not limited to the following log types:
-
-- Proxy
-
-- Web server
-
-- Operating system
-
-- Firewall
-
-- Intrusion Prevention/Detection (IDS/ IPS)
-
-- Data Loss Prevention
-
-- Active Directory / Identity Management
-
-- User/Entity Behavior Analysis
-
-- Endpoint Protection/Asset Management
-
-- Network Metadata/Session and PCAP files
-
-- Network Access Control
-
-- Mail
-
-- VPN
-
-- etc..
-
-One of the biggest challenges organizations face today in combating cyber
-threats is collecting and normalizing data from the myriad of security event
-data sources (hundreds) in order to build the needed analytics. This often
-results in the analytics being dependent upon the specific technologies used by
-an organization to detect threats and prevents the needed flexibility and
-agility to keep up with these ever-increasing (and complex) threats. Technology
-lock-in is sometimes a byproduct of today’s status quo, as it’s extremely costly
-to add new technologies (or replace existing ones) because of the downstream
-analytic dependencies.
-
-To achieve the goal of extending Apache Spot (incubating) to support additional
-use cases, it is necessary to create an open data model for the most relevant
-security event and contextual data sources; Security event logs or alerts,
-Network context, User details and information that comes from the endpoints or
-any other console that are being use to manage the security / administration of
-our endpoints. The presence of an open data model, which can be applied
-“on-read” or “on-write”, in batch or stream, will allow for the separation of
-security analytics from the specific data sources on which they are built. This
-“separation of duties” will enable organizations to build analytics that are not
-dependent upon specific technologies and provide the flexibility to change
-underlying data sources and also provide segmentation of this information,
-without impacting the analytics. This will also afford security vendors the
-opportunity to build additional products on top of the Open Data Model to drive
-new revenue streams and also to design new ways to detect threats and APT.
-
-**Apache Spot (incubating) Enabled Use Cases**
-----------------------------------------------
-
-Spot on the Cloudera platform is uniquely positioned to help address the
-following cyber security use cases, which are not effectively addressed by
-legacy technologies:
-
-**- Detection of known & unknown threats leveraging machine learning and
-advanced analytic modeling**
-
-Current technologies are limited in the analytics they can apply to detect
-threats. These limitations stem from the inability to collect all the data
-sources needed to effectively identify threats (structured, unstructured, etc.)
-and inability to process the massive volumes of data needed to do so (billions
-of events per day). Legacy technologies are typically focus and limited to
-rules-based and signature detection. They are somewhat “effective” at detecting
-known threats but struggle with new threats.
-
-Spot addresses these gaps through its ability to collect any data type of any
-volume. Coupled with the various analytic frameworks that are provided
-(including machine learning), Spot enables a whole new class of analytics that
-can scale to today’s demands. The topic model used by Spot to detect anomalous
-network traffic is one example of where the Spot platform excels.
-
-**- Reduction of mean time to incident detection & resolution (MTTR)**
-
-One of the challenges organizations face today is detecting threats early enough
-to minimize adverse impacts. This stems from the limitations previously
-discussed with regards to limited analytics. It can also be attributed to the
-fact that most of the investigative queries often take hours or days to return
-results. Legacy technologies can’t offer or have a central data store for
-facilitating such investigations due to their inability to store and serve the
-massive amounts of data involved. This cripples incident investigations and
-results in MTTRs of many weeks or months, meanwhile the adverse impacts of the
-breach are magnified, thus making the threat harder to eradicate.
-
-Apache Spot (incubating) addresses these gaps by providing the capability for a
-central data store that houses ALL the data needed to facilitate an
-investigation, returning investigative query results in seconds and minutes (vs.
-hours and days). Spot can effectively reduce incident MTTR and reduce adverse
-impacts of a breach.
-
-**- Threat Hunting**
-
-It’s become necessary for organizations to “hunt” for active threats as
-traditional passive threat detection approaches are not sufficient. “Hunting”
-involves performing ad-hoc searches and queries over vast amounts of data
-representing many weeks and months’ worth of events, as well as applying ad-hoc
-/ tune algorithms to detect the needle in the haystack. Traditional systems do
-not perform well for these types of activities as the query results sometimes
-take hours and days to be retrieved. These traditional systems also lack the
-analytic flexibility to construct the necessary algorithms and logic needed.
-
-Apache Spot (incubating) addresses these gaps in the same ways it addresses
-others; by providing a central data store with the needed analytic frameworks
-that scale to the needed workloads.
-
-**Data Model**
---------------
-
-In order to provide a framework for effectively analyzing data for cyber
-threats, it is necessary to collect and analyze standard security event
-logs/alerts and contextual data regarding the entities referenced in these
-logs/alerts. The most common entities include network, user and endpoint, but
-there are others such as file.
-
-In the diagram below, the raw event tells us that user “jsmith” successfully
-logged in to an Oracle database from the IP address 10:1.1.3. Based on the raw
-event only, we don’t know if this event is a legitimate threat or not. After
-injecting user and endpoint context, the enriched event tells us this event is a
-potential threat that requires further investigation.
-
-![](https://lh3.googleusercontent.com/-Q8TasmY-vRQ/WHVnoXAK44I/AAAAAAAAAtw/XBDy3PC98k800iaWpNIzAYoQ8S9zc5NBQCLcB/s0/ODMimage1.jpg)
-
-Based on the need to collect and analyze both security events, logs or alerts
-and contextual data, support for the following types of security information are
-planned for inclusion in the Spot Open Data Model:
-
-- Security event logs/alerts This data type includes event logs from common
- data sources used to detect threats and includes network flows, operating
- system logs, IPS/IDS logs, firewall logs, proxy logs, web logs, DLP logs,
- etc.
-
-- Network context data This data type includes information about the network,
- which can be gleaned from Whois servers, asset databases and other similar
- data sources.
-
-- User context data This data type includes information from user and identity
- management systems including Active Directory, Centrify, and other identity
- and access management systems.
-
-- Endpoint context data This data includes information about endpoint systems
- (servers, workstations, routers, switches, etc.) and can be sourced from
- asset management systems, vulnerability scanners, and endpoint
- management/detection/response systems such as Webroot, Tanium, Sophos,
- Endgame, CarbonBlack, Intel Security ePO and others.
-
-- File context data **(ROADMAP ITEM)** This data includes contextual
- information about files and can be sourced from systems such as FireEye,
- Application Control , Intel Security McAfee Threat Intelligence Exchange
- (TIE).
-
-- Threat intelligence context data **(ROADMAP ITEM)** This data includes
- contextual information about URLs, domains, websites, files and others.
-
-**Naming Convention**
----------------------
-
-A naming convention is needed for the Open Data Model to represent common
-attributes across vendor products and technologies. The naming convention is
-described below.
-
-**Prefixes**
-------------
-
-| Prefix | Description |
-|----------|-----------------------------------------------------------------------------------------------------------------------------------|
-| src | Corresponds to the “source” fields within a given event (i.e. source address) |
-| dst | Corresponds to the “destination” fields within a given event (i.e. destination address) |
-| dvc | Corresponds to the “device” applicable fields within a given event (i.e. device address) and represent where the event originated |
-| fwd | Forwarded from device |
-| request | Corresponds to requested values (vs. those returned, i.e. “requested URI”) |
-| response | Corresponds to response value (vs. those requested) |
-| file | Corresponds to the “file” fields within a given event (i.e. file type) |
-| user | Corresponds to user attributes (i.e. name, id, etc.) |
-| xlate | Corresponds to translated values within a given event (i.e. src_xlate_ip for “translated source ip address” |
-| in | Ingress |
-| out | Egress |
-| new | New value |
-| orig | Original value |
-| app | Corresponds to values associated with application events |
-
-**Security Event Log/Alert Data Model**
----------------------------------------
-
-The data model for security event logs/alerts is detailed in the below. The
-attributes are categorized as follows:
-
-- Common -attributes that are common across many device types
-
-- Device -attributes that are applicable to the device that generated the
- event
-
-- File -attributes that are applicable to file objects referenced in the event
-
-- Endpoint -attributes that are applicable to the endpoints referenced in the
- event
-
-- User- attributes that are applicable to the user referenced in the event
-
-- Proxy - attributes that are applicable to proxy events
-
-- Protocol
-
-- DNS - attributes that are specific to DNS events
-
-- HTTP - attributes that are specific to HTTP events
-
-- SMTP, SSH, TLS, DHCP, IRC, SNMP and FTP
-
-Note: The model will evolve to include reserved attributes for additional device
-types that are not currently represented. The model can currently be extended to
-support ANY attribute for ANY device type by following the guidance outlined in
-the section titled [“Extensibility of Data Model”.](#extensibility)
-
-Note: Attributes denoted in **Bold**, represent those that are listed in the
-model multiple times for the purpose of demonstrating attribute coverage for a
-particular entity (endpoint, user, network, etc.) or log type (Proxy, DNS,
-etc.).
-
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-|--------------|---------------------------|-------------------|-----------------------------------------------------------------------|-------------------------------------------------------------------------------------|
-| **Common** | eventtime | long | timestamp of event (UTC) | 1472653952 |
-| | duration | int | Time duration (milliseconds) | 2345 |
-| | eventid | string | Unique identifier for event | x:2388 |
-| | org | string | Organization | “HR” or “Finance” or “CustomerA” |
-| | type | string | Type information | “Informational”, “image/gif” |
-| | nproto | string | Network protocol of event | TCP, UDP, ICMP |
-| | aproto | string | Application protocol of event | HTTP, NFS, FTP |
-| | msg | string | Message (details of action taken on object) | Some long string |
-| | mac | string | MAC address | 94:94:26:3:86:16 |
-| | severity | string | Severity of event | High, 10, 1 |
-| | raw | string | Raw text message of entire event | Complete copy of log entry |
-| | risk | Floating point | Risk score | 95.67 |
-| | code | string | Response or error code | 404 |
-| | category | string | Event category | /Application/Start |
-| | qry | string | Query (DNS query, URI query, SQL query, etc.) | Select \* from "table" |
-| | service | string | (i.e. service name, type of service) | sshd |
-| | state | string | State of object | Running, Paused, stopped |
-| | in_bytes | int | Bytes in | 1025 |
-| | out_bytes | int | Bytes out | 9344 |
-| | additional_attrs | String (JSON Map) | Custom event attributes | "building":"729","cube":"401" |
-| | dvc_time | long | UTC timestamp from device where event/alert originates or is received | 1472653952 |
-| | dvc_ip4/dvc_ip6 | long | IP address of device | Integer representaion of 10.1.1.1 |
-| | dvc_host | string | Hostname of device | Integer representaion of 10.1.1.1 |
-| | dvc_type | string | Device type that generated the log | Unix, Windows, Sonicwall |
-| | dvc_vendor | string | Vendor | Microsoft, Fireeye, Intel Security |
-| | dvc_version | string | Version | 5.4 |
-| | fwd_ip4/fwd_ip6 | long | Forwarded from device | Integer representation of 10.1.1.1 |
-| | version | string | Version | “3.2.2” |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **Network** | src_ip4/src_ip6 | bigint | Source ip address of event | Integer representation of 10.1.1.1 |
-| | src_host | string | Source FQDN of event | test.companyA.com |
-| | src_domain | string | Domain name of source address | companyA.com |
-| | src_port | int | Source port of event | 1025 |
-| | src_country_code | string | Source country code | cn |
-| | src_country_name | string | Source country name | China |
-| | src_region | string | Source region | string |
-| | src_city | string | Source city | Shenghai |
-| | src_lat | int | Source latitude | |
-| | src_long | int | Source longitude | |
-| | dst_ip4/dst_ip6 | bigint | Destination ip address of event | Integer representaion of 10.1.1.1 |
-| | dst_host | string | Destination FQDN of event | test.companyA.com |
-| | dst_domain | string | Domain name of destination address | companyA.com |
-| | dst_port | int | Destination port of event | 80 |
-| | dst_country_code | string | Source country code | cn |
-| | dst_country_name | string | Source country name | China |
-| | dst_region | string | Source region | string |
-| | dst_city | string | Source city | Shenghai |
-| | dst_lat | int | Source latitude | |
-| | dst_long | int | Source longitude | |
-| | asn | int | Autonomous system number | 33 |
-| | **in_bytes** | int | Bytes in | 987 |
-| | **out_bytes** | int | Bytes out | 1222 |
-| | direction | string | Direction | In, inbound, outbound, ingress, egress |
-| | flags | string | TCP flags | .AP.SF |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **File** | file_name | string | Filename from event | output.csv |
-| | file_path | string | File path | /root/output.csv |
-| | file_atime | bigint | Timestamp (UTC) of file access | 1472653952 |
-| | file_acls | string | File permissions | rwx-rwx-rwx |
-| | file_type | string | Type of file | “.doc” |
-| | file_size | int | Size of file in bytes | 1244 |
-| | file_desc | string | Description of file | Project Plan for Project xyz |
-| | file_hash | string | Hash of file | |
-| | file_hash_type | string | Type of hash | MD5, SHA1,SHA256 |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **Endpoint** | object | string | File/Process/Registry | File, Registry, Process |
-| | action | string | Action taken on object (open/delete/edit) | Open, Edit |
-| | **msg** | string | Message (details of action taken on object) | Some long string |
-| | app | string | Application | Microsoft Powerpoint |
-| | location | string | Location | Atlanta, GA |
-| | proc | string | Process | SSHD |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **User** | user_name | string | username from event | mhicks |
-| | email | string | Email address | test\@companyA.com |
-| | user_id | string | userid | 234456 |
-| | user_loc | string | location | Herndon, VA |
-| | user_desc | string | Description of user | |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **DNS** | dns_class | string | DNS class | 1 |
-| | dns_length | int | DNS frame length | 188 |
-| | **dns_qry** | string | Requested DNS query | test.test.com |
-| | **dns_code** | string | Response code | 0x00000001 |
-| | dns_response_qry | string | Response to DNS Query | 178.2.1.99 |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **Proxy** | **category** | string | Event category | SG-HTTP-SERVICE |
-| | browser | string | Web browser | Internet Explorer |
-| | **code** | string | Error or response code | 404 |
-| | **in_bytes** | int | Bytes in | 1025 |
-| | **out_bytes** | int | Bytes out | 1288 |
-| | referrer | string | Referrer | www.usatoday.com |
-| | **request_uri** | string | Requested URI | /wcm/assets/images/imagefileicon.gif |
-| | filter_rule | string | Applied filter or rule | Internet, Rule 6 |
-| | filter_result | string | Result of applied filter or rule | Proxied, Blocked |
-| | **qry** | string | URI query | ?func=S_senseHTML&Page=a26815a313504697a126279 |
-| | **action** | string | Action taken on object | TCP_HIT, TCP_MISS, TCP_TUNNELED |
-| | method | string | HTTP method | GET, CONNECT, POST |
-| | **type** | string | Type of request | image/gif |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **HTTP** | request_method | string | HTTP method | GET, CONNECT, POST |
-| | **request_uri** | string | Requested URI | /wcm/assets/images/imagefileicon.gif |
-| | request_body_len | int | Length of request body | 98 |
-| | request_user_name | string | username from event | mhicks |
-| | request_password | string | Password from event | abc123 |
-| | request_proxied | string | | |
-| | request_headers | MAP | HTTP request headers | request_headers[‘HOST’] request_headers[‘USER-AGENT’] request_headers[‘ACCEPT’] |
-| | response_status_code | int | HTTP response status code | 404 |
-| | response_status_msg | string | HTTP response status message | “Not found” |
-| | response_body_len | int | Length of response body | 98 |
-| | response_info_code | int | HTTP response info code | 100 |
-| | response_info_msg | string | HTTP response info message | “Some string” |
-| | response_resp_fuids | string | Response FUIDS | |
-| | response_mime_types | string | Mime types | “cgi,bat,exe” |
-| | response_headers | MAP | Response headers | response_headers[‘SERVER’] response_headers[‘SET-COOKIE’’] response_headers[‘DATE’] |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **SMTP** | trans_depth | int | Depth of email into SMTP exchange | Coming soon |
-| | headers_helo | string | Helo header | Coming soon |
-| | headers_mailfrom | string | Mailfrom header | Coming soon |
-| | headers_rcptto | string | Rcptto header | Coming soon |
-| | headers_date | string | Header date | Coming soon |
-| | headers_from | string | From header | Coming soon |
-| | headers_to | string | To header | Coming soon |
-| | headers_reply_to | string | Reply to header | Coming soon |
-| | headers_msg_id | string | Message ID | Coming soon |
-| | headers_in_reply_to | string | In reply to header | Coming soon |
-| | headers_subject | string | Subject | Coming soon |
-| | headers_x_originating_ip4 | bigint | Originating IP address | Coming soon |
-| | headers_first_received | string | First to receive message | Coming soon |
-| | headers_second_received | string | Second to receive message | Coming soon |
-| | last_reply | string | Last reply in message chain | Coming soon |
-| | path | string | Path of message | Coming soon |
-| | user_agent | string | User agent | Coming soon |
-| | tls | boolean | Indication of TLS use | Coming soon |
-| | is_webmail | boolean | Indication of webmail | Coming soon |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **FTP** | **user_name** | string | Username | Coming soon |
-| | password | string | Password | Coming soon |
-| | command | string | FTP command | Coming soon |
-| | arg | string | Argument | Coming soon |
-| | mime_type | string | Mime type | Coming soon |
-| | file_size | int | File size | Coming soon |
-| | reply_code | int | Reply code | Coming soon |
-| | reply_msg | string | Reply message | Coming soon |
-| | data_channel_passive | boolean | Passive data channel? | Coming soon |
-| | data_channel_rsp_p | string | | Coming soon |
-| | cwd | string | Current working directory | Coming soon |
-| | cmdarg_ts | float | | Coming soon |
-| | cmdarg_cmd | string | Command | Coming soon |
-| | cmdarg_arg | string | Command argument | Coming soon |
-| | cmdarg_seq | int | Sequence | Coming soon |
-| | pending_commands | string | Pending commands | Coming soon |
-| | is_passive | boolean | Passive mode enabled | Coming soon |
-| | fuid | string | Coming soon | Coming soon |
-| | last_auth_requested | string | Coming soon | Coming soon |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **SNMP** | **version** | string | Coming soon | Coming soon |
-| | community | string | Coming soon | Coming soon |
-| | get_requests | int | Coming soon | Coming soon |
-| | get_bulk_requests | int | Coming soon | Coming soon |
-| | get_responses | int | Coming soon | Coming soon |
-| | set_requests | int | Coming soon | Coming soon |
-| | display_string | string | Coming soon | Coming soon |
-| | up_since | float | Coming soon | Coming soon |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **TLS** | **version** | string | Coming soon | Coming soon |
-| | cipher | string | Coming soon | Coming soon |
-| | curve | string | Coming soon | Coming soon |
-| | server_name | string | Coming soon | Coming soon |
-| | resumed | boolean | Coming soon | Coming soon |
-| | next_protocol | string | Coming soon | Coming soon |
-| | established | boolean | Coming soon | Coming soon |
-| | cert_chain_fuids | string | Coming soon | Coming soon |
-| | client_cert_chain_fuids | string | Coming soon | Coming soon |
-| | subject | string | Coming soon | Coming soon |
-| | issuer | string | Coming soon | Coming soon |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **SSH** | **version** | string | Coming soon | Coming soon |
-| | auth_success | boolean | Coming soon | Coming soon |
-| | client | string | Coming soon | Coming soon |
-| | server | string | Coming soon | Coming soon |
-| | cipher_algorithm | string | Coming soon | Coming soon |
-| | mac_algorithm | string | Coming soon | Coming soon |
-| | compression_algorithm | string | Coming soon | Coming soon |
-| | key_exchange_algorithm | string | Coming soon | Coming soon |
-| | host_key_algorithm | string | Coming soon | Coming soon |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **DHCP** | assigned_ip4 | bigint | Coming soon | Coming soon |
-| | mac | string | Coming soon | Coming soon |
-| | lease_time | double | Coming soon | Coming soon |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **IRC** | user | string | Coming soon | Coming soon |
-| | nickname | string | Coming soon | Coming soon |
-| | command | string | Coming soon | Coming soon |
-| | value | string | Coming soon | Coming soon |
-| | additional_data | string | Coming soon | Coming soon |
-| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
-| **Flow** | in_packets | int | Coming soon | Coming soon |
-| | out_packets | int | Coming soon | Coming soon |
-| | **in_bytes** | int | Coming soon | Coming soon |
-| | **out_bytes** | int | Coming soon | Coming soon |
-| | conn_state | string | Coming soon | Coming soon |
-| | history | string | Coming soon | Coming soon |
-| | duration | float | Coming soon | Coming soon |
-| | src_os | string | Coming soon | Coming soon |
-| | dst_os | string | Coming soon | Coming soon |
-
-Note: It is not necessary to populate all of the attributes within the model.
-For attributes not populated in a single security event log/alert, contextual
-data may not be available. For example, the sample event below can be enriched
-with contextual data about the referenced endpoints (10.1.1.1 and
-192.168.10.10), but not a user, because username is not populated.
-
-> **date,time,source_ip,source_port,protocol,destination_ip,destination_port,bytes
-> 12/12/2015,23:14:56,10.1.1.1,1025,tcp,192.168.10.10,443,1183**
-
-**Context Models**
-==================
-
-The recommended approach for populating the context models (user, endpoint,
-network, etc.) involves consuming information from the systems most capable or
-providing the needed context. Populating the user context model is best
-accomplished by leveraging user/identity management systems such as Active
-Directory or Centrify and populating the model with details such as the user’s
-full name, job title, phone number, manager’s name, physical address,
-entitlements, etc. Similarly, an endpoint model can be populated by consuming
-information from endpoint/asset management systems (Tanium, Webroot, etc.),
-which provide information such as the services running on the system, system
-owner, business context, etc.
-
-**User Context Model**
-----------------------
-
-The data model for user context information is as follows:
-
-| **Attribute** | **Data Type** | **Description** | **Sample Values** |
-|------------------|------------------------------------------------------|--------------------------------------------------------------|-------------------------------------|
-| dvc_time | bigint | Timestamp from when the user context information is obtained | 1472653952 |
-| created | bigint | Timestamp from when user was created | 1472653952 |
-| Changed–––– | bigint | Timestamp from when user was updated | 1472653952 |
-| lastlogon | bigint | Timestamp from when user last logged on | 1472653952 |
-| logoncount | int | Number of times account has logged on | 232 |
-| lastreset | bigint | Timestamp from when user last reset passwod | 1472653952 |
-| expiration | bigint | Date/time when user expires | 1472653952 |
-| userid | string | Unique user id | 1234 |
-| username | string | Username in event log/alert | jsmith |
-| name_first | string | First name | John |
-| name_middle | string | Middle name | Henry |
-| name_last | string | Last name | Smith |
-| name_mgr | string | Manager’s name | Ronald Reagan |
-| phone | string | Phone number | 703-555-1212 |
-| email | string | Email address | jsmith\@company.com |
-| code | string | Job code | 3455 |
-| loc | string | Location | US |
-| departm | string | Department | IT |
-| dn | | Distinguished name | "CN=scm-admin-mej-test2-adk,OU=app- |
-| ou | string | Organizational unit | EAST |
-| empid | string | Employee ID | 12345 |
-| title | string | Job Title | Director of IT |
-| groups | string (comma separated list, no spaces after comma) | Groups to which the user belongs | “Domain Admins”, “Domain Users” |
-| dvc_type | string | Device type that generated the user context data | Active Directory |
-| dvc_vendor | string | Vendor | Microsoft |
-| dvc_version | string | Version | 8.1.2 |
-| additional_attrs | string | Additional attributes of user | Key value pairs |
-
-**Endpoint Context Model**
---------------------------
-
-The data model for endpoint context information is as follows:
-
-| **Abbreviation** | **Data Type** | **Description** | **Sample Values** |
-|------------------|--------------------------------------------------|------------------------------------------------------------------|------------------------------------------------------|
-| dvc_time | bigint | Timestamp from when the endpoint context information is obtained | 1472653952 |
-| ip4 | bigint | IP address of endpoint | Integer representaion of 10.1.1.1 |
-| ip6 | bigint | IP address of endpoint | Integer representaion of 10.1.1.1 |
-| os | string | Operating system | Redhat Linux 6.5.1 |
-| os_version | string | Version of OS | 5.4 |
-| os_sp | string | Service pack | SP 2.3.4.55 |
-| tz | string | timezone | EST |
-| hotfixes | string | Applied hotfixes | 993.2 |
-| disks | string | Available disks | \\Device\\HarddiskVolume1, \\Device\\HarddiskVolume2 |
-| removables | string | Removable media devices | USB Key |
-| nics | string | Network interfaces | fe10::28f4:1a47:658b:d6e8, fe82::28f4:1a47:658b:d6e8 |
-| drivers | string | Installed kernel drivers | ntoskrnl.exe, hal.dll |
-| users | string | Local user accounts | administrator, jsmith |
-| host | string | Hostname of endpoint | tes1.companya.com |
-| mac | string | MAC address of endpoint | fe10::28f4:1a47:658b:d6e8 |
-| owner | string | Endpoint owner (name) | John Smith |
-| vulns | string (comma separated, no spaces after commas) | Vulnerability identifiers (CVE identifier) | CVE-123, CVE-456 |
-| loc | string | Location | US |
-| departm | string | Department name | IT |
-| company | string | Company name | CompanyA |
-| regs | string (comma-separated) | Applicable regulations | HIPAA, SOX |
-| svcs | string (comma-separated) | Services running on system | Cisco Systems, Inc. VPN Service, Adobe LM Service |
-| procs | string | Processes | svchost.exe, sppsvc.exe |
-| criticality | string | Criticality of device | Very High |
-| apps | string (comma-separated) | Applications running on system | Microsoft Word, Chrome |
-| desc | string | Endpoint descriptor | Some string |
-| dvc_type | string | Device type that generated the log | Microsoft Windows 7 |
-| dvc_vendor | string | Vendor | Endgame |
-| dvc_version | string | Version | 2.1 |
-| architecture | string | CPU architecture | x86 |
-| uuid | string | Universally unique identifier | a59ba71e-18b0-f762-2f02-0deaf95076c6 |
-| memtotal | int | Total memory (bytes) | 844564433 |
-| additional_attrs | string | Additional attributes | Key value pairs |
-
-**VPN Context Model**
----------------------
-
-The data model for VPN context information is based on the VPN logs as follows:
-
-| **Abbreviation** | **Data Type** | **Description** | **Sample Values** |
-|------------------|-------------------------|----------------------------------------------------------------------------|------------------------------------------------------|
-| dvc_time | bigint | Timestamp from when the endpoint context information is obtained | 1472653952 |
-| ip4 | bigint | IP address of VPN box | Integer representaion of 10.1.1.1 |
-| ip6 | bigint | IP address of VPN box | Integer representaion of 10.1.1.1 |
-| vpn_vendor | string | Vendor VPN | Cisco |
-| vpn_version | string | Version VPN | 3.0 |
-| vpn_sp | string | VPN Service pack | 5 |
-| tz | string | VPN timezone | EST |
-| vpn_hotfixes | string | VPN Applied hotfixes | 1134 |
-| vpn_nics | string | Network interfaces | fe10::28f4:1a47:658b:d6e8, fe82::28f4:1a47:658b:d6e8 |
-| vpn_host | VPN Country Code | string | MX |
-| vpn_country_name | VPN Country Name | string | Mexico |
-| vpn_ip | | string | Integer representation of 10.1.1.2 |
-| vpn_encrypt | VPN encryption protocol | string | IPSEC |
-| vpn_username | string | VPN user account | jsmith |
-| vpn_user_ip | string | VPN User IP address | Integer representation of 10.1.1.2 |
-| vpn_user_cc | string | VPN Country Code | US |
-| vpn_user_cn | string | VPN Country Name | United States |
-| vpn_user_auth | string | VPN user authorization / role | Admin, normal user, etc |
-| vpn_account_vip | string | Criticality of the VPN account | Medium, High |
-| vpn_uuid | string | Universally unique identifier | a59ba71e-18b0-f762-2f02-0deaf95076c6 |
-| uuids | string | Universally unique identifier(s) comes from thee endpoint context if match | a59ba71e-18b0-f762-2f02-0deaf95xmexzA |
-| additional_attrs | string | Additional attributes | Key value pairs |
-
-**Network Context Model**
--------------------------
-
-The data model for network context information is based on “whois” information
-as follows:
-
-| **Attribute** | **Data Type** | **Description** | **Sample Values** |
-|----------------------------------------|---------------|----------------------------------------|-------------------|
-| domain_name | string | Domain name | |
-| registry_domain_id | string | Registry Domain ID | |
-| registrar_whois_server | string | Registrar WHOIS Server | |
-| registrar_url | string | Registrar URL | |
-| update_date | bigint | UTC timestamp | |
-| creation_date | bigint | Creation Date | |
-| registrar_registration_expiration_date | bigint | Registrar Registration Expiration Date | |
-| registrar | string | Registrar | |
-| registrar_iana_id | string | Registrar IANA ID | |
-| registrar_abuse_contact_email | string | Registrar Abuse Contact Email | |
-| registrar_abuse_contact_phone | string | Registrar Abuse Contact Phone | |
-| domain_status | string | Domain Status | |
-| registry_registrant_id | string | Registry Registrant ID | |
-| registrant_name | string | Registrant Name | |
-| registrant_organization | string | Registrant Organization | |
-| registrant_street | string | Registrant Street | |
-| registrant_city | string | Registrant City | |
-| registrant_state_province | string | Registrant State/Province | |
-| registrant_postal_code | string | Registrant Postal Code | |
-| registrant_country | string | Registrant Country | |
-| registrant_phone | string | Registrant Phone | |
-| registrant_email | string | Registrant Email | |
-| registry_admin_id | string | Registry Admin ID | |
-| name_server | string | Name Server | |
-| dnssec | string | DNSSEC | |
-
-### **Extensibility of Data Model**
-
-The aforementioned data model can be extended to accommodate custom attributes
-by embedding key-value pairs within the log/alert/context entries. Each model
-will support an additional attribute by the name of additional_attrs whose value
-would be a JSON string. This JSON string will contain a Map (and only a Map) of
-additional attributes that can’t be expressed in the specified model
-description. Regardless of the type of these additional attributes, they will
-always be interpreted as String. It’s up to the user, to translate them to
-appropriate types, if necessary, in the analytics layer. It is also the user’s
-responsibility to populate the aforementioned attribute as a Map, by presumably
-parsing out these attributes from the original message. For example, if a user
-wanted to extend the user context model to include a string attribute for “Desk
-Location” and “City”, the following string would be set for additional_attrs:
-
-| **Attribute Key** | **Attribute Value** |
-|-------------------|-------------------------------------------------|
-| additional_attrs | {"dsk_location":"B3-F2-W3", "city":"Palo Alto"} |
-
-Something similar can be done for endpoint context model, security event
-log/alert model and other entities.
-
-**Note:** This [UDF library](https://github.com/klout/brickhouse) can be used
-for converting to/from JSON.
-
-**Model Relationships**
------------------------
-
-The relationships between the data model entities are illustrated below.
-
-![enter image description here](https://lh3.googleusercontent.com/-SxEubiTPzFE/WHVo0uxgJtI/AAAAAAAAAt8/3su9v3h0MsovJ0Mhy08EbuFTvRvKEoIwQCLcB/s0/ODMimage2.jpg)
-
-**Data Ingestion Framework**
-----------------------------
-
-One of the challenges in populating the data model is the large number of
-products and technologies that organizations are currently using to manage
-security event logs/alerts, user and endpoint information. There are literally
-dozens of vendors in each category that offer technologies that could be used to
-populate the model. The labor required to transform the data and map the
-attributes to the data model is extensive when you consider how many
-technologies are in the mix at each organization (and across organizations). One
-way to address this challenge is with a Data Ingestion Framework that provides a
-configuration-based mechanism to perform the transformations and mappings. A
-configuration-based capability will allow the ingest pipelines to become
-portable and reusable across the community. For example, if I create an ingest
-pipeline for Centrify to populate the user context model, it can be shared with
-other users of Centrify who can immediately realize the benefit. Such a
-framework could allow the community to quickly build the necessary pipelines for
-the dozens (and hundreds) of technologies being used in the market. Without a
-standard ingest framework, each pipeline is built independently, requiring more
-labor, providing no standardization and little portability. It’s also important
-that the data ingestion framework support the ability to both capture the “raw”
-event and create a meta event that represents the normalized event and maps the
-attributes to the defined data model. This will ensure both stream and batch
-processing use cases are supported.
-
-Streamsets is an ingest framework that provides the needed functionality
-outlined above. Sample Streamsets ingest pipelines for populating the ODM with
-common data sources will be published to the Spot Github repo.
-
-**Data Formats**
-----------------
-
-**Avro**
---------
-
-Avro is the recommended data format due to its schema representation,
-compatibility checks, and interoperability with Hadoop. Avro supports a pure
-JSON representation for readability and ease of use but also a binary
-representation of the data for efficient storage. Avro is the optimal format for
-streaming-based analytic use cases. A sample event and corresponding schema
-representation are detailed below.
-
-**Event**
-
-{
-
-"eventtime":1469562994,
-
-"src_ip4":”192.168.1.1”,
-
-“src_host”:”test1.clouera.com”,
-
-“src_port”:1029, “dst_ip4”:”192.168.21.22”,
-
-“dst_host”:”test3.companyA.com”,
-
-“dst_port”:443,
-
-“dvc_type”:”sshd”,
-
-“category”:”auth”,
-
-“aproto”:”sshd”,
-
-“msg”:”user:mhicks successfully logged in to test3.companyA.com from
-192.168.1.1”,
-
-“username”:”mhicks”,
-
-“Severity”:3,
-
-}
-
-
-
-**Schema**
-
-{
-
-"type": "record",
-
-"doc":"This event records SSHD activity",
-
-"name": "auth",
-
-"fields":{
-
-{"name":"eventtime", "type":"long", "doc":"Stop time of event""},
-
-{"name":"src_ip4", "type":"long", "doc":”Source IP Address"},
-
-{"name":"src_host", "type":"string",”doc”:”Source hostname},
-
-{"name":"src_port", "type":"int",”doc”:”Source port”},
-
-{"name":"dst_ip4", "type":"long", "doc"::”Destination IP Address"},
-
-{"name":"dst_host", "type":"string", "doc":”Destination IP Address"},
-
-{"name":"dst_port", "type":"int", ”doc”:”Destination port”},
-
-{"name":"dvc_type", "type":"string", “doc”:”Source device type”},
-
-{"name":"category", "type":"string",”doc”:”category/type of event message”},
-
-{"name":"aproto", "type":"string",”doc”:”Application or network protocol”},
-
-{"name":"msg", "type":"string",”doc”:”event message”},
-
-{"name":"username", "type":"string",”doc”:”username”},
-
-{"name":"severity", "type":"int",”doc”:”severity of event on scale of 1-10”},
-
-}
-
-
-
-**JSON**
---------
-
-JSON is commonly used as a data-interchange format due to it’s ease of use and
-familiarity within the development community. The corresponding JSON object for
-the sample event described previously is noted below.
-
-{
-
-“eventtime”:1469562994,
-
-“src_ip4”:”192.168.1.1”,
-
-“src_host”:”test1.clouera.com”,
-
-“src_port”:1029,
-
-“dst_ip4”:”192.168.21.22”,
-
-“dst_host”:”test3.companyA.com”,
-
-“dst_port”:443,
-
-“aproto”:”sshd”,
-
-“msg”:”user:mhicks successfully logged in to test3.companyA.com from
-192.168.1.1”,
-
-“username”:”mhicks”,
-
-}
-
-**Parquet**
------------
-
-Parquet is a columnar storage format that offers the benefits of compression and
-efficient columnar data representation and is optimal for batch analytic use
-cases. More information on parquet can be found here:
-https://parquet.apache.org/documentation/latest/ It should be noted that
-conversion from Avro to Parquet is supported. This allows for data collected and
-analyzed for stream-based use cases to be easily converted to Parquet for
-longer-term batch analytics.
-
-**Example - Advanced Threat Modeling**
---------------------------------------
-
-In this example, the ODM is leveraged to build an “event” table for a threat
-model that uses attributes native to the ODM and derived attributes, which are
-calculations based on the aggregate data stored in the model. In this context,
-an “event” table is defined by the attributes to be evaluated for predictive
-power in identifying threats and the actual attribute values (i.e rows in the
-table). In the example below, the event table is composed of the following
-attributes, which are then leveraged to identify threats via a Risk Score
-analytic model:
-
-- “src_ipv4” - This attribute is native to the security event log component of
- the ODM and represents the source IP address of the corresponding table row
-
-- “os” - This attribute is native to the endpoint context component of the ODM
- and represents the operating system of the endpoint system in the table row
-
-- SUM (in_bytes + out_bytes) for the last 7 days - “in_bytes” and “out_bytes”
- are native to the security event log component of the ODM. This derived
- attribute represents a summation of bytes between the source address and
- destination domain over the last 7 days
-
-- “dst_domain” - This attribute is native to the security event log component
- of the ODM and represents the destination domain
-
-- Days since “creation_date” - “creation_date” is native to the network
- context component of the ODM and represents the date the referenced domain
- was registered. This derived attribute calculates the days since the domain
- was created/registered.
-
-| **src_ipv4** | **OS** | **dst domain** | **Days since “creation_date”** | **SUM (in_bytes + out_bytes)** | **Risk Score (1-100)** |
-|--------------|-----------|----------------|--------------------
<TRUNCATED>