You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spot.apache.org by na...@apache.org on 2017/09/27 20:48:53 UTC
[1/3] [incubator-spot] Git Push Summary
Repository: incubator-spot
Updated Branches:
refs/heads/SPOT-181_ODM 34df9ad9b -> 392e2a903
[2/3] incubator-spot git commit: reinstating ODM document
Posted by na...@apache.org.
http://git-wip-us.apache.org/repos/asf/incubator-spot/blob/392e2a90/docs/open-data-model.md
----------------------------------------------------------------------
diff --git a/docs/open-data-model.md b/docs/open-data-model.md
new file mode 100644
index 0000000..292d586
--- /dev/null
+++ b/docs/open-data-model.md
@@ -0,0 +1,892 @@
+**Open Data Model (ODM)**
+=========================
+
+**Overview**
+------------
+
+This document describes a strategy for creating an open data model (ODM) for
+Apache Spot (incubating) (formerly known as “Open Network Insight (ONI)”) in
+support of cyber security analytic use cases. It also describes the use cases
+for which Apache Spot (incubating) running on the Cloudera platform is uniquely
+capable of addressing along with the data model.
+
+**Apache Spot (incubating) Open Data Model Strategy**
+-----------------------------------------------------
+
+The Apache Spot (incubating) Open Data Model (ODM) strategy aims to extend
+Apache Spot (incubating) capabilities to support a broader set of cyber security
+use cases than initially supported. The primary use case initially supported by
+Apache Spot (incubating) includes Network Traffic Analysis for network flows
+(Netflow, sflow, etc.), DNS and Proxy; primarily the identification of threats
+through anomalous event detection using both supervised and unsupervised machine
+learning.
+
+In order to support a broader set of use cases, Spot must be extended to collect
+and analyze other common “event-oriented” data sources analyzed for cyber
+threats, including but not limited to the following log types:
+
+- Proxy
+
+- Web server
+
+- Operating system
+
+- Firewall
+
+- Intrusion Prevention/Detection (IDS/ IPS)
+
+- Data Loss Prevention
+
+- Active Directory / Identity Management
+
+- User/Entity Behavior Analysis
+
+- Endpoint Protection/Asset Management
+
+- Network Metadata/Session and PCAP files
+
+- Network Access Control
+
+- Mail
+
+- VPN
+
+- etc..
+
+One of the biggest challenges organizations face today in combating cyber
+threats is collecting and normalizing data from the myriad of security event
+data sources (hundreds) in order to build the needed analytics. This often
+results in the analytics being dependent upon the specific technologies used by
+an organization to detect threats and prevents the needed flexibility and
+agility to keep up with these ever-increasing (and complex) threats. Technology
+lock-in is sometimes a byproduct of today’s status quo, as it’s extremely costly
+to add new technologies (or replace existing ones) because of the downstream
+analytic dependencies.
+
+To achieve the goal of extending Apache Spot (incubating) to support additional
+use cases, it is necessary to create an open data model for the most relevant
+security event and contextual data sources; Security event logs or alerts,
+Network context, User details and information that comes from the endpoints or
+any other console that are being use to manage the security / administration of
+our endpoints. The presence of an open data model, which can be applied
+“on-read” or “on-write”, in batch or stream, will allow for the separation of
+security analytics from the specific data sources on which they are built. This
+“separation of duties” will enable organizations to build analytics that are not
+dependent upon specific technologies and provide the flexibility to change
+underlying data sources and also provide segmentation of this information,
+without impacting the analytics. This will also afford security vendors the
+opportunity to build additional products on top of the Open Data Model to drive
+new revenue streams and also to design new ways to detect threats and APT.
+
+**Apache Spot (incubating) Enabled Use Cases**
+----------------------------------------------
+
+Spot on the Cloudera platform is uniquely positioned to help address the
+following cyber security use cases, which are not effectively addressed by
+legacy technologies:
+
+**- Detection of known & unknown threats leveraging machine learning and
+advanced analytic modeling**
+
+Current technologies are limited in the analytics they can apply to detect
+threats. These limitations stem from the inability to collect all the data
+sources needed to effectively identify threats (structured, unstructured, etc.)
+and inability to process the massive volumes of data needed to do so (billions
+of events per day). Legacy technologies are typically focus and limited to
+rules-based and signature detection. They are somewhat “effective” at detecting
+known threats but struggle with new threats.
+
+Spot addresses these gaps through its ability to collect any data type of any
+volume. Coupled with the various analytic frameworks that are provided
+(including machine learning), Spot enables a whole new class of analytics that
+can scale to today’s demands. The topic model used by Spot to detect anomalous
+network traffic is one example of where the Spot platform excels.
+
+**- Reduction of mean time to incident detection & resolution (MTTR)**
+
+One of the challenges organizations face today is detecting threats early enough
+to minimize adverse impacts. This stems from the limitations previously
+discussed with regards to limited analytics. It can also be attributed to the
+fact that most of the investigative queries often take hours or days to return
+results. Legacy technologies can’t offer or have a central data store for
+facilitating such investigations due to their inability to store and serve the
+massive amounts of data involved. This cripples incident investigations and
+results in MTTRs of many weeks or months, meanwhile the adverse impacts of the
+breach are magnified, thus making the threat harder to eradicate.
+
+Apache Spot (incubating) addresses these gaps by providing the capability for a
+central data store that houses ALL the data needed to facilitate an
+investigation, returning investigative query results in seconds and minutes (vs.
+hours and days). Spot can effectively reduce incident MTTR and reduce adverse
+impacts of a breach.
+
+**- Threat Hunting**
+
+It’s become necessary for organizations to “hunt” for active threats as
+traditional passive threat detection approaches are not sufficient. “Hunting”
+involves performing ad-hoc searches and queries over vast amounts of data
+representing many weeks and months’ worth of events, as well as applying ad-hoc
+/ tune algorithms to detect the needle in the haystack. Traditional systems do
+not perform well for these types of activities as the query results sometimes
+take hours and days to be retrieved. These traditional systems also lack the
+analytic flexibility to construct the necessary algorithms and logic needed.
+
+Apache Spot (incubating) addresses these gaps in the same ways it addresses
+others; by providing a central data store with the needed analytic frameworks
+that scale to the needed workloads.
+
+**Data Model**
+--------------
+
+In order to provide a framework for effectively analyzing data for cyber
+threats, it is necessary to collect and analyze standard security event
+logs/alerts and contextual data regarding the entities referenced in these
+logs/alerts. The most common entities include network, user and endpoint, but
+there are others such as file.
+
+In the diagram below, the raw event tells us that user “jsmith” successfully
+logged in to an Oracle database from the IP address 10:1.1.3. Based on the raw
+event only, we don’t know if this event is a legitimate threat or not. After
+injecting user and endpoint context, the enriched event tells us this event is a
+potential threat that requires further investigation.
+
+![](https://lh3.googleusercontent.com/-Q8TasmY-vRQ/WHVnoXAK44I/AAAAAAAAAtw/XBDy3PC98k800iaWpNIzAYoQ8S9zc5NBQCLcB/s0/ODMimage1.jpg)
+
+Based on the need to collect and analyze both security events, logs or alerts
+and contextual data, support for the following types of security information are
+planned for inclusion in the Spot Open Data Model:
+
+- Security event logs/alerts This data type includes event logs from common
+ data sources used to detect threats and includes network flows, operating
+ system logs, IPS/IDS logs, firewall logs, proxy logs, web logs, DLP logs,
+ etc.
+
+- Network context data This data type includes information about the network,
+ which can be gleaned from Whois servers, asset databases and other similar
+ data sources.
+
+- User context data This data type includes information from user and identity
+ management systems including Active Directory, Centrify, and other identity
+ and access management systems.
+
+- Endpoint context data This data includes information about endpoint systems
+ (servers, workstations, routers, switches, etc.) and can be sourced from
+ asset management systems, vulnerability scanners, and endpoint
+ management/detection/response systems such as Webroot, Tanium, Sophos,
+ Endgame, CarbonBlack, Intel Security ePO and others.
+
+- File context data **(ROADMAP ITEM)** This data includes contextual
+ information about files and can be sourced from systems such as FireEye,
+ Application Control , Intel Security McAfee Threat Intelligence Exchange
+ (TIE).
+
+- Threat intelligence context data **(ROADMAP ITEM)** This data includes
+ contextual information about URLs, domains, websites, files and others.
+
+**Naming Convention**
+---------------------
+
+A naming convention is needed for the Open Data Model to represent common
+attributes across vendor products and technologies. The naming convention is
+described below.
+
+**Prefixes**
+------------
+
+| Prefix | Description |
+|----------|-----------------------------------------------------------------------------------------------------------------------------------|
+| src | Corresponds to the “source” fields within a given event (i.e. source address) |
+| dst | Corresponds to the “destination” fields within a given event (i.e. destination address) |
+| dvc | Corresponds to the “device” applicable fields within a given event (i.e. device address) and represent where the event originated |
+| fwd | Forwarded from device |
+| request | Corresponds to requested values (vs. those returned, i.e. “requested URI”) |
+| response | Corresponds to response value (vs. those requested) |
+| file | Corresponds to the “file” fields within a given event (i.e. file type) |
+| user | Corresponds to user attributes (i.e. name, id, etc.) |
+| xlate | Corresponds to translated values within a given event (i.e. src_xlate_ip for “translated source ip address” |
+| in | Ingress |
+| out | Egress |
+| new | New value |
+| orig | Original value |
+| app | Corresponds to values associated with application events |
+
+**Security Event Log/Alert Data Model**
+---------------------------------------
+
+The data model for security event logs/alerts is detailed in the below. The
+attributes are categorized as follows:
+
+- Common -attributes that are common across many device types
+
+- Device -attributes that are applicable to the device that generated the
+ event
+
+- File -attributes that are applicable to file objects referenced in the event
+
+- Endpoint -attributes that are applicable to the endpoints referenced in the
+ event
+
+- User- attributes that are applicable to the user referenced in the event
+
+- Proxy - attributes that are applicable to proxy events
+
+- Protocol
+
+- DNS - attributes that are specific to DNS events
+
+- HTTP - attributes that are specific to HTTP events
+
+- SMTP, SSH, TLS, DHCP, IRC, SNMP and FTP
+
+Note: The model will evolve to include reserved attributes for additional device
+types that are not currently represented. The model can currently be extended to
+support ANY attribute for ANY device type by following the guidance outlined in
+the section titled [“Extensibility of Data Model”.](#extensibility)
+
+Note: Attributes denoted in **Bold**, represent those that are listed in the
+model multiple times for the purpose of demonstrating attribute coverage for a
+particular entity (endpoint, user, network, etc.) or log type (Proxy, DNS,
+etc.).
+
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+|--------------|---------------------------|-------------------|-----------------------------------------------------------------------|-------------------------------------------------------------------------------------|
+| **Common** | eventtime | long | timestamp of event (UTC) | 1472653952 |
+| | duration | int | Time duration (milliseconds) | 2345 |
+| | eventid | string | Unique identifier for event | x:2388 |
+| | org | string | Organization | “HR” or “Finance” or “CustomerA” |
+| | type | string | Type information | “Informational”, “image/gif” |
+| | nproto | string | Network protocol of event | TCP, UDP, ICMP |
+| | aproto | string | Application protocol of event | HTTP, NFS, FTP |
+| | msg | string | Message (details of action taken on object) | Some long string |
+| | mac | string | MAC address | 94:94:26:3:86:16 |
+| | severity | string | Severity of event | High, 10, 1 |
+| | raw | string | Raw text message of entire event | Complete copy of log entry |
+| | risk | Floating point | Risk score | 95.67 |
+| | code | string | Response or error code | 404 |
+| | category | string | Event category | /Application/Start |
+| | qry | string | Query (DNS query, URI query, SQL query, etc.) | Select \* from "table" |
+| | service | string | (i.e. service name, type of service) | sshd |
+| | state | string | State of object | Running, Paused, stopped |
+| | in_bytes | int | Bytes in | 1025 |
+| | out_bytes | int | Bytes out | 9344 |
+| | additional_attrs | String (JSON Map) | Custom event attributes | "building":"729","cube":"401" |
+| | dvc_time | long | UTC timestamp from device where event/alert originates or is received | 1472653952 |
+| | dvc_ip4/dvc_ip6 | long | IP address of device | Integer representaion of 10.1.1.1 |
+| | dvc_host | string | Hostname of device | Integer representaion of 10.1.1.1 |
+| | dvc_type | string | Device type that generated the log | Unix, Windows, Sonicwall |
+| | dvc_vendor | string | Vendor | Microsoft, Fireeye, Intel Security |
+| | dvc_version | string | Version | 5.4 |
+| | fwd_ip4/fwd_ip6 | long | Forwarded from device | Integer representation of 10.1.1.1 |
+| | version | string | Version | “3.2.2” |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **Network** | src_ip4/src_ip6 | bigint | Source ip address of event | Integer representation of 10.1.1.1 |
+| | src_host | string | Source FQDN of event | test.companyA.com |
+| | src_domain | string | Domain name of source address | companyA.com |
+| | src_port | int | Source port of event | 1025 |
+| | src_country_code | string | Source country code | cn |
+| | src_country_name | string | Source country name | China |
+| | src_region | string | Source region | string |
+| | src_city | string | Source city | Shenghai |
+| | src_lat | int | Source latitude | |
+| | src_long | int | Source longitude | |
+| | dst_ip4/dst_ip6 | bigint | Destination ip address of event | Integer representaion of 10.1.1.1 |
+| | dst_host | string | Destination FQDN of event | test.companyA.com |
+| | dst_domain | string | Domain name of destination address | companyA.com |
+| | dst_port | int | Destination port of event | 80 |
+| | dst_country_code | string | Source country code | cn |
+| | dst_country_name | string | Source country name | China |
+| | dst_region | string | Source region | string |
+| | dst_city | string | Source city | Shenghai |
+| | dst_lat | int | Source latitude | |
+| | dst_long | int | Source longitude | |
+| | asn | int | Autonomous system number | 33 |
+| | **in_bytes** | int | Bytes in | 987 |
+| | **out_bytes** | int | Bytes out | 1222 |
+| | direction | string | Direction | In, inbound, outbound, ingress, egress |
+| | flags | string | TCP flags | .AP.SF |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **File** | file_name | string | Filename from event | output.csv |
+| | file_path | string | File path | /root/output.csv |
+| | file_atime | bigint | Timestamp (UTC) of file access | 1472653952 |
+| | file_acls | string | File permissions | rwx-rwx-rwx |
+| | file_type | string | Type of file | “.doc” |
+| | file_size | int | Size of file in bytes | 1244 |
+| | file_desc | string | Description of file | Project Plan for Project xyz |
+| | file_hash | string | Hash of file | |
+| | file_hash_type | string | Type of hash | MD5, SHA1,SHA256 |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **Endpoint** | object | string | File/Process/Registry | File, Registry, Process |
+| | action | string | Action taken on object (open/delete/edit) | Open, Edit |
+| | **msg** | string | Message (details of action taken on object) | Some long string |
+| | app | string | Application | Microsoft Powerpoint |
+| | location | string | Location | Atlanta, GA |
+| | proc | string | Process | SSHD |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **User** | user_name | string | username from event | mhicks |
+| | email | string | Email address | test\@companyA.com |
+| | user_id | string | userid | 234456 |
+| | user_loc | string | location | Herndon, VA |
+| | user_desc | string | Description of user | |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **DNS** | dns_class | string | DNS class | 1 |
+| | dns_length | int | DNS frame length | 188 |
+| | **dns_qry** | string | Requested DNS query | test.test.com |
+| | **dns_code** | string | Response code | 0x00000001 |
+| | dns_response_qry | string | Response to DNS Query | 178.2.1.99 |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **Proxy** | **category** | string | Event category | SG-HTTP-SERVICE |
+| | browser | string | Web browser | Internet Explorer |
+| | **code** | string | Error or response code | 404 |
+| | **in_bytes** | int | Bytes in | 1025 |
+| | **out_bytes** | int | Bytes out | 1288 |
+| | referrer | string | Referrer | www.usatoday.com |
+| | **request_uri** | string | Requested URI | /wcm/assets/images/imagefileicon.gif |
+| | filter_rule | string | Applied filter or rule | Internet, Rule 6 |
+| | filter_result | string | Result of applied filter or rule | Proxied, Blocked |
+| | **qry** | string | URI query | ?func=S_senseHTML&Page=a26815a313504697a126279 |
+| | **action** | string | Action taken on object | TCP_HIT, TCP_MISS, TCP_TUNNELED |
+| | method | string | HTTP method | GET, CONNECT, POST |
+| | **type** | string | Type of request | image/gif |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **HTTP** | request_method | string | HTTP method | GET, CONNECT, POST |
+| | **request_uri** | string | Requested URI | /wcm/assets/images/imagefileicon.gif |
+| | request_body_len | int | Length of request body | 98 |
+| | request_user_name | string | username from event | mhicks |
+| | request_password | string | Password from event | abc123 |
+| | request_proxied | string | | |
+| | request_headers | MAP | HTTP request headers | request_headers[‘HOST’] request_headers[‘USER-AGENT’] request_headers[‘ACCEPT’] |
+| | response_status_code | int | HTTP response status code | 404 |
+| | response_status_msg | string | HTTP response status message | “Not found” |
+| | response_body_len | int | Length of response body | 98 |
+| | response_info_code | int | HTTP response info code | 100 |
+| | response_info_msg | string | HTTP response info message | “Some string” |
+| | response_resp_fuids | string | Response FUIDS | |
+| | response_mime_types | string | Mime types | “cgi,bat,exe” |
+| | response_headers | MAP | Response headers | response_headers[‘SERVER’] response_headers[‘SET-COOKIE’’] response_headers[‘DATE’] |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **SMTP** | trans_depth | int | Depth of email into SMTP exchange | Coming soon |
+| | headers_helo | string | Helo header | Coming soon |
+| | headers_mailfrom | string | Mailfrom header | Coming soon |
+| | headers_rcptto | string | Rcptto header | Coming soon |
+| | headers_date | string | Header date | Coming soon |
+| | headers_from | string | From header | Coming soon |
+| | headers_to | string | To header | Coming soon |
+| | headers_reply_to | string | Reply to header | Coming soon |
+| | headers_msg_id | string | Message ID | Coming soon |
+| | headers_in_reply_to | string | In reply to header | Coming soon |
+| | headers_subject | string | Subject | Coming soon |
+| | headers_x_originating_ip4 | bigint | Originating IP address | Coming soon |
+| | headers_first_received | string | First to receive message | Coming soon |
+| | headers_second_received | string | Second to receive message | Coming soon |
+| | last_reply | string | Last reply in message chain | Coming soon |
+| | path | string | Path of message | Coming soon |
+| | user_agent | string | User agent | Coming soon |
+| | tls | boolean | Indication of TLS use | Coming soon |
+| | is_webmail | boolean | Indication of webmail | Coming soon |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **FTP** | **user_name** | string | Username | Coming soon |
+| | password | string | Password | Coming soon |
+| | command | string | FTP command | Coming soon |
+| | arg | string | Argument | Coming soon |
+| | mime_type | string | Mime type | Coming soon |
+| | file_size | int | File size | Coming soon |
+| | reply_code | int | Reply code | Coming soon |
+| | reply_msg | string | Reply message | Coming soon |
+| | data_channel_passive | boolean | Passive data channel? | Coming soon |
+| | data_channel_rsp_p | string | | Coming soon |
+| | cwd | string | Current working directory | Coming soon |
+| | cmdarg_ts | float | | Coming soon |
+| | cmdarg_cmd | string | Command | Coming soon |
+| | cmdarg_arg | string | Command argument | Coming soon |
+| | cmdarg_seq | int | Sequence | Coming soon |
+| | pending_commands | string | Pending commands | Coming soon |
+| | is_passive | boolean | Passive mode enabled | Coming soon |
+| | fuid | string | Coming soon | Coming soon |
+| | last_auth_requested | string | Coming soon | Coming soon |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **SNMP** | **version** | string | Coming soon | Coming soon |
+| | community | string | Coming soon | Coming soon |
+| | get_requests | int | Coming soon | Coming soon |
+| | get_bulk_requests | int | Coming soon | Coming soon |
+| | get_responses | int | Coming soon | Coming soon |
+| | set_requests | int | Coming soon | Coming soon |
+| | display_string | string | Coming soon | Coming soon |
+| | up_since | float | Coming soon | Coming soon |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **TLS** | **version** | string | Coming soon | Coming soon |
+| | cipher | string | Coming soon | Coming soon |
+| | curve | string | Coming soon | Coming soon |
+| | server_name | string | Coming soon | Coming soon |
+| | resumed | boolean | Coming soon | Coming soon |
+| | next_protocol | string | Coming soon | Coming soon |
+| | established | boolean | Coming soon | Coming soon |
+| | cert_chain_fuids | string | Coming soon | Coming soon |
+| | client_cert_chain_fuids | string | Coming soon | Coming soon |
+| | subject | string | Coming soon | Coming soon |
+| | issuer | string | Coming soon | Coming soon |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **SSH** | **version** | string | Coming soon | Coming soon |
+| | auth_success | boolean | Coming soon | Coming soon |
+| | client | string | Coming soon | Coming soon |
+| | server | string | Coming soon | Coming soon |
+| | cipher_algorithm | string | Coming soon | Coming soon |
+| | mac_algorithm | string | Coming soon | Coming soon |
+| | compression_algorithm | string | Coming soon | Coming soon |
+| | key_exchange_algorithm | string | Coming soon | Coming soon |
+| | host_key_algorithm | string | Coming soon | Coming soon |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **DHCP** | assigned_ip4 | bigint | Coming soon | Coming soon |
+| | mac | string | Coming soon | Coming soon |
+| | lease_time | double | Coming soon | Coming soon |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **IRC** | user | string | Coming soon | Coming soon |
+| | nickname | string | Coming soon | Coming soon |
+| | command | string | Coming soon | Coming soon |
+| | value | string | Coming soon | Coming soon |
+| | additional_data | string | Coming soon | Coming soon |
+| **Category** | **Attribute** | **Data Type** | **Description** | **Sample Values** |
+| **Flow** | in_packets | int | Coming soon | Coming soon |
+| | out_packets | int | Coming soon | Coming soon |
+| | **in_bytes** | int | Coming soon | Coming soon |
+| | **out_bytes** | int | Coming soon | Coming soon |
+| | conn_state | string | Coming soon | Coming soon |
+| | history | string | Coming soon | Coming soon |
+| | duration | float | Coming soon | Coming soon |
+| | src_os | string | Coming soon | Coming soon |
+| | dst_os | string | Coming soon | Coming soon |
+
+Note: It is not necessary to populate all of the attributes within the model.
+For attributes not populated in a single security event log/alert, contextual
+data may not be available. For example, the sample event below can be enriched
+with contextual data about the referenced endpoints (10.1.1.1 and
+192.168.10.10), but not a user, because username is not populated.
+
+> **date,time,source_ip,source_port,protocol,destination_ip,destination_port,bytes
+> 12/12/2015,23:14:56,10.1.1.1,1025,tcp,192.168.10.10,443,1183**
+
+**Context Models**
+==================
+
+The recommended approach for populating the context models (user, endpoint,
+network, etc.) involves consuming information from the systems most capable or
+providing the needed context. Populating the user context model is best
+accomplished by leveraging user/identity management systems such as Active
+Directory or Centrify and populating the model with details such as the user’s
+full name, job title, phone number, manager’s name, physical address,
+entitlements, etc. Similarly, an endpoint model can be populated by consuming
+information from endpoint/asset management systems (Tanium, Webroot, etc.),
+which provide information such as the services running on the system, system
+owner, business context, etc.
+
+**User Context Model**
+----------------------
+
+The data model for user context information is as follows:
+
+| **Attribute** | **Data Type** | **Description** | **Sample Values** |
+|------------------|------------------------------------------------------|--------------------------------------------------------------|-------------------------------------|
+| dvc_time | bigint | Timestamp from when the user context information is obtained | 1472653952 |
+| created | bigint | Timestamp from when user was created | 1472653952 |
+| Changed–––– | bigint | Timestamp from when user was updated | 1472653952 |
+| lastlogon | bigint | Timestamp from when user last logged on | 1472653952 |
+| logoncount | int | Number of times account has logged on | 232 |
+| lastreset | bigint | Timestamp from when user last reset passwod | 1472653952 |
+| expiration | bigint | Date/time when user expires | 1472653952 |
+| userid | string | Unique user id | 1234 |
+| username | string | Username in event log/alert | jsmith |
+| name_first | string | First name | John |
+| name_middle | string | Middle name | Henry |
+| name_last | string | Last name | Smith |
+| name_mgr | string | Manager’s name | Ronald Reagan |
+| phone | string | Phone number | 703-555-1212 |
+| email | string | Email address | jsmith\@company.com |
+| code | string | Job code | 3455 |
+| loc | string | Location | US |
+| departm | string | Department | IT |
+| dn | | Distinguished name | "CN=scm-admin-mej-test2-adk,OU=app- |
+| ou | string | Organizational unit | EAST |
+| empid | string | Employee ID | 12345 |
+| title | string | Job Title | Director of IT |
+| groups | string (comma separated list, no spaces after comma) | Groups to which the user belongs | “Domain Admins”, “Domain Users” |
+| dvc_type | string | Device type that generated the user context data | Active Directory |
+| dvc_vendor | string | Vendor | Microsoft |
+| dvc_version | string | Version | 8.1.2 |
+| additional_attrs | string | Additional attributes of user | Key value pairs |
+
+**Endpoint Context Model**
+--------------------------
+
+The data model for endpoint context information is as follows:
+
+| **Abbreviation** | **Data Type** | **Description** | **Sample Values** |
+|------------------|--------------------------------------------------|------------------------------------------------------------------|------------------------------------------------------|
+| dvc_time | bigint | Timestamp from when the endpoint context information is obtained | 1472653952 |
+| ip4 | bigint | IP address of endpoint | Integer representaion of 10.1.1.1 |
+| ip6 | bigint | IP address of endpoint | Integer representaion of 10.1.1.1 |
+| os | string | Operating system | Redhat Linux 6.5.1 |
+| os_version | string | Version of OS | 5.4 |
+| os_sp | string | Service pack | SP 2.3.4.55 |
+| tz | string | timezone | EST |
+| hotfixes | string | Applied hotfixes | 993.2 |
+| disks | string | Available disks | \\Device\\HarddiskVolume1, \\Device\\HarddiskVolume2 |
+| removables | string | Removable media devices | USB Key |
+| nics | string | Network interfaces | fe10::28f4:1a47:658b:d6e8, fe82::28f4:1a47:658b:d6e8 |
+| drivers | string | Installed kernel drivers | ntoskrnl.exe, hal.dll |
+| users | string | Local user accounts | administrator, jsmith |
+| host | string | Hostname of endpoint | tes1.companya.com |
+| mac | string | MAC address of endpoint | fe10::28f4:1a47:658b:d6e8 |
+| owner | string | Endpoint owner (name) | John Smith |
+| vulns | string (comma separated, no spaces after commas) | Vulnerability identifiers (CVE identifier) | CVE-123, CVE-456 |
+| loc | string | Location | US |
+| departm | string | Department name | IT |
+| company | string | Company name | CompanyA |
+| regs | string (comma-separated) | Applicable regulations | HIPAA, SOX |
+| svcs | string (comma-separated) | Services running on system | Cisco Systems, Inc. VPN Service, Adobe LM Service |
+| procs | string | Processes | svchost.exe, sppsvc.exe |
+| criticality | string | Criticality of device | Very High |
+| apps | string (comma-separated) | Applications running on system | Microsoft Word, Chrome |
+| desc | string | Endpoint descriptor | Some string |
+| dvc_type | string | Device type that generated the log | Microsoft Windows 7 |
+| dvc_vendor | string | Vendor | Endgame |
+| dvc_version | string | Version | 2.1 |
+| architecture | string | CPU architecture | x86 |
+| uuid | string | Universally unique identifier | a59ba71e-18b0-f762-2f02-0deaf95076c6 |
+| memtotal | int | Total memory (bytes) | 844564433 |
+| additional_attrs | string | Additional attributes | Key value pairs |
+
+**VPN Context Model**
+---------------------
+
+The data model for VPN context information is based on the VPN logs as follows:
+
+| **Abbreviation** | **Data Type** | **Description** | **Sample Values** |
+|------------------|-------------------------|----------------------------------------------------------------------------|------------------------------------------------------|
+| dvc_time | bigint | Timestamp from when the endpoint context information is obtained | 1472653952 |
+| ip4 | bigint | IP address of VPN box | Integer representaion of 10.1.1.1 |
+| ip6 | bigint | IP address of VPN box | Integer representaion of 10.1.1.1 |
+| vpn_vendor | string | Vendor VPN | Cisco |
+| vpn_version | string | Version VPN | 3.0 |
+| vpn_sp | string | VPN Service pack | 5 |
+| tz | string | VPN timezone | EST |
+| vpn_hotfixes | string | VPN Applied hotfixes | 1134 |
+| vpn_nics | string | Network interfaces | fe10::28f4:1a47:658b:d6e8, fe82::28f4:1a47:658b:d6e8 |
+| vpn_host | VPN Country Code | string | MX |
+| vpn_country_name | VPN Country Name | string | Mexico |
+| vpn_ip | | string | Integer representation of 10.1.1.2 |
+| vpn_encrypt | VPN encryption protocol | string | IPSEC |
+| vpn_username | string | VPN user account | jsmith |
+| vpn_user_ip | string | VPN User IP address | Integer representation of 10.1.1.2 |
+| vpn_user_cc | string | VPN Country Code | US |
+| vpn_user_cn | string | VPN Country Name | United States |
+| vpn_user_auth | string | VPN user authorization / role | Admin, normal user, etc |
+| vpn_account_vip | string | Criticality of the VPN account | Medium, High |
+| vpn_uuid | string | Universally unique identifier | a59ba71e-18b0-f762-2f02-0deaf95076c6 |
+| uuids | string | Universally unique identifier(s) comes from thee endpoint context if match | a59ba71e-18b0-f762-2f02-0deaf95xmexzA |
+| additional_attrs | string | Additional attributes | Key value pairs |
+
+**Network Context Model**
+-------------------------
+
+The data model for network context information is based on “whois” information
+as follows:
+
+| **Attribute** | **Data Type** | **Description** | **Sample Values** |
+|----------------------------------------|---------------|----------------------------------------|-------------------|
+| domain_name | string | Domain name | |
+| registry_domain_id | string | Registry Domain ID | |
+| registrar_whois_server | string | Registrar WHOIS Server | |
+| registrar_url | string | Registrar URL | |
+| update_date | bigint | UTC timestamp | |
+| creation_date | bigint | Creation Date | |
+| registrar_registration_expiration_date | bigint | Registrar Registration Expiration Date | |
+| registrar | string | Registrar | |
+| registrar_iana_id | string | Registrar IANA ID | |
+| registrar_abuse_contact_email | string | Registrar Abuse Contact Email | |
+| registrar_abuse_contact_phone | string | Registrar Abuse Contact Phone | |
+| domain_status | string | Domain Status | |
+| registry_registrant_id | string | Registry Registrant ID | |
+| registrant_name | string | Registrant Name | |
+| registrant_organization | string | Registrant Organization | |
+| registrant_street | string | Registrant Street | |
+| registrant_city | string | Registrant City | |
+| registrant_state_province | string | Registrant State/Province | |
+| registrant_postal_code | string | Registrant Postal Code | |
+| registrant_country | string | Registrant Country | |
+| registrant_phone | string | Registrant Phone | |
+| registrant_email | string | Registrant Email | |
+| registry_admin_id | string | Registry Admin ID | |
+| name_server | string | Name Server | |
+| dnssec | string | DNSSEC | |
+
+### **Extensibility of Data Model**
+
+The aforementioned data model can be extended to accommodate custom attributes
+by embedding key-value pairs within the log/alert/context entries. Each model
+will support an additional attribute by the name of additional_attrs whose value
+would be a JSON string. This JSON string will contain a Map (and only a Map) of
+additional attributes that can’t be expressed in the specified model
+description. Regardless of the type of these additional attributes, they will
+always be interpreted as String. It’s up to the user, to translate them to
+appropriate types, if necessary, in the analytics layer. It is also the user’s
+responsibility to populate the aforementioned attribute as a Map, by presumably
+parsing out these attributes from the original message. For example, if a user
+wanted to extend the user context model to include a string attribute for “Desk
+Location” and “City”, the following string would be set for additional_attrs:
+
+| **Attribute Key** | **Attribute Value** |
+|-------------------|-------------------------------------------------|
+| additional_attrs | {"dsk_location":"B3-F2-W3", "city":"Palo Alto"} |
+
+Something similar can be done for endpoint context model, security event
+log/alert model and other entities.
+
+**Note:** This [UDF library](https://github.com/klout/brickhouse) can be used
+for converting to/from JSON.
+
+**Model Relationships**
+-----------------------
+
+The relationships between the data model entities are illustrated below.
+
+![enter image description here](https://lh3.googleusercontent.com/-SxEubiTPzFE/WHVo0uxgJtI/AAAAAAAAAt8/3su9v3h0MsovJ0Mhy08EbuFTvRvKEoIwQCLcB/s0/ODMimage2.jpg)
+
+**Data Ingestion Framework**
+----------------------------
+
+One of the challenges in populating the data model is the large number of
+products and technologies that organizations are currently using to manage
+security event logs/alerts, user and endpoint information. There are literally
+dozens of vendors in each category that offer technologies that could be used to
+populate the model. The labor required to transform the data and map the
+attributes to the data model is extensive when you consider how many
+technologies are in the mix at each organization (and across organizations). One
+way to address this challenge is with a Data Ingestion Framework that provides a
+configuration-based mechanism to perform the transformations and mappings. A
+configuration-based capability will allow the ingest pipelines to become
+portable and reusable across the community. For example, if I create an ingest
+pipeline for Centrify to populate the user context model, it can be shared with
+other users of Centrify who can immediately realize the benefit. Such a
+framework could allow the community to quickly build the necessary pipelines for
+the dozens (and hundreds) of technologies being used in the market. Without a
+standard ingest framework, each pipeline is built independently, requiring more
+labor, providing no standardization and little portability. It’s also important
+that the data ingestion framework support the ability to both capture the “raw”
+event and create a meta event that represents the normalized event and maps the
+attributes to the defined data model. This will ensure both stream and batch
+processing use cases are supported.
+
+Streamsets is an ingest framework that provides the needed functionality
+outlined above. Sample Streamsets ingest pipelines for populating the ODM with
+common data sources will be published to the Spot Github repo.
+
+**Data Formats**
+----------------
+
+**Avro**
+--------
+
+Avro is the recommended data format due to its schema representation,
+compatibility checks, and interoperability with Hadoop. Avro supports a pure
+JSON representation for readability and ease of use but also a binary
+representation of the data for efficient storage. Avro is the optimal format for
+streaming-based analytic use cases. A sample event and corresponding schema
+representation are detailed below.
+
+**Event**
+
+{
+
+"eventtime":1469562994,
+
+"src_ip4":”192.168.1.1”,
+
+“src_host”:”test1.clouera.com”,
+
+“src_port”:1029, “dst_ip4”:”192.168.21.22”,
+
+“dst_host”:”test3.companyA.com”,
+
+“dst_port”:443,
+
+“dvc_type”:”sshd”,
+
+“category”:”auth”,
+
+“aproto”:”sshd”,
+
+“msg”:”user:mhicks successfully logged in to test3.companyA.com from
+192.168.1.1”,
+
+“username”:”mhicks”,
+
+“Severity”:3,
+
+}
+
+
+
+**Schema**
+
+{
+
+"type": "record",
+
+"doc":"This event records SSHD activity",
+
+"name": "auth",
+
+"fields":{
+
+{"name":"eventtime", "type":"long", "doc":"Stop time of event""},
+
+{"name":"src_ip4", "type":"long", "doc":”Source IP Address"},
+
+{"name":"src_host", "type":"string",”doc”:”Source hostname},
+
+{"name":"src_port", "type":"int",”doc”:”Source port”},
+
+{"name":"dst_ip4", "type":"long", "doc"::”Destination IP Address"},
+
+{"name":"dst_host", "type":"string", "doc":”Destination IP Address"},
+
+{"name":"dst_port", "type":"int", ”doc”:”Destination port”},
+
+{"name":"dvc_type", "type":"string", “doc”:”Source device type”},
+
+{"name":"category", "type":"string",”doc”:”category/type of event message”},
+
+{"name":"aproto", "type":"string",”doc”:”Application or network protocol”},
+
+{"name":"msg", "type":"string",”doc”:”event message”},
+
+{"name":"username", "type":"string",”doc”:”username”},
+
+{"name":"severity", "type":"int",”doc”:”severity of event on scale of 1-10”},
+
+}
+
+
+
+**JSON**
+--------
+
+JSON is commonly used as a data-interchange format due to it’s ease of use and
+familiarity within the development community. The corresponding JSON object for
+the sample event described previously is noted below.
+
+{
+
+“eventtime”:1469562994,
+
+“src_ip4”:”192.168.1.1”,
+
+“src_host”:”test1.clouera.com”,
+
+“src_port”:1029,
+
+“dst_ip4”:”192.168.21.22”,
+
+“dst_host”:”test3.companyA.com”,
+
+“dst_port”:443,
+
+“aproto”:”sshd”,
+
+“msg”:”user:mhicks successfully logged in to test3.companyA.com from
+192.168.1.1”,
+
+“username”:”mhicks”,
+
+}
+
+**Parquet**
+-----------
+
+Parquet is a columnar storage format that offers the benefits of compression and
+efficient columnar data representation and is optimal for batch analytic use
+cases. More information on parquet can be found here:
+https://parquet.apache.org/documentation/latest/ It should be noted that
+conversion from Avro to Parquet is supported. This allows for data collected and
+analyzed for stream-based use cases to be easily converted to Parquet for
+longer-term batch analytics.
+
+**Example - Advanced Threat Modeling**
+--------------------------------------
+
+In this example, the ODM is leveraged to build an “event” table for a threat
+model that uses attributes native to the ODM and derived attributes, which are
+calculations based on the aggregate data stored in the model. In this context,
+an “event” table is defined by the attributes to be evaluated for predictive
+power in identifying threats and the actual attribute values (i.e rows in the
+table). In the example below, the event table is composed of the following
+attributes, which are then leveraged to identify threats via a Risk Score
+analytic model:
+
+- “src_ipv4” - This attribute is native to the security event log component of
+ the ODM and represents the source IP address of the corresponding table row
+
+- “os” - This attribute is native to the endpoint context component of the ODM
+ and represents the operating system of the endpoint system in the table row
+
+- SUM (in_bytes + out_bytes) for the last 7 days - “in_bytes” and “out_bytes”
+ are native to the security event log component of the ODM. This derived
+ attribute represents a summation of bytes between the source address and
+ destination domain over the last 7 days
+
+- “dst_domain” - This attribute is native to the security event log component
+ of the ODM and represents the destination domain
+
+- Days since “creation_date” - “creation_date” is native to the network
+ context component of the ODM and represents the date the referenced domain
+ was registered. This derived attribute calculates the days since the domain
+ was created/registered.
+
+| **src_ipv4** | **OS** | **dst domain** | **Days since “creation_date”** | **SUM (in_bytes + out_bytes)** | **Risk Score (1-100)** |
+|--------------|-----------|----------------|------------------------
<TRUNCATED>
[3/3] incubator-spot git commit: reinstating ODM document
Posted by na...@apache.org.
reinstating ODM document
Project: http://git-wip-us.apache.org/repos/asf/incubator-spot/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spot/commit/392e2a90
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spot/tree/392e2a90
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spot/diff/392e2a90
Branch: refs/heads/SPOT-181_ODM
Commit: 392e2a903df456d64b4a572da5c88039b464e9d5
Parents: 34df9ad
Author: natedogs911 <na...@gmail.com>
Authored: Wed Sep 27 13:48:26 2017 -0700
Committer: natedogs911 <na...@gmail.com>
Committed: Wed Sep 27 13:48:26 2017 -0700
----------------------------------------------------------------------
docs/open-data-model.md | 892 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 892 insertions(+)
----------------------------------------------------------------------