Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security 2007

Front Matter: Volume 6570

Show abstract

This PDF file contains the front matter associated with SPIE Proceedings Volume 6570, including the Title Page, Copyright information, Table of Contents, Introduction, and the Conference Committee listing.

Bot armies as threats to network security

Sheila B. Banks, Martin R. Stytz

Show abstract

"Botnets", or "bot armies", are large groups of remotely controlled malicious software. Bot armies pose one of the most serious security threats to all networks. Botnets, remotely controlled and operated by botmasters or botherders, can launch massive denial of service attacks, multiple penetration attacks, or any other malicious network activity on a massive scale. While bot army activity has, in the past, been limited to fraud, blackmail, and other forms of criminal activity, their potential for causing large-scale damage to the entire internet; for launching large-scale, coordinated attacks on government computers and networks; and for large-scale, coordinated data gathering from thousands of users and computers on any network has been underestimated. This paper will not discuss how to build bots but the threats they pose. In a "botnet" or "bot army", computers can be used to spread spam, launch denial-of-service attacks against Web sites, conduct fraudulent activities, and prevent authorized network traffic from traversing the network. In this paper we discuss botnets and the technologies that underlie this threat to network and computer security. The first section motivates the need for improved protection against botnets, their technologies, and for further research about botnets. The second contains background information about bot armies and their key underlying technologies. The third section presents a discussion of the types of attacks that botnets can conduct and potential defenses against them. The fourth section contains a summary and suggestions for future research and development.

Defending against Internet worms using a phase space method from chaos theory

Jing Hu, Jianbo Gao, Nageswara S. Rao

Show abstract

Enterprise networks are facing ever-increasing security threats from Distributed Denial of Service (DDoS) attacks, worms, viruses, intrusions, Trojans, port scans, and network misuses, and thus effective monitoring approaches to quickly detect these activities are greatly needed. In this paper, we employ chaos theory and propose an interesting phase space method to detect Internet worms. An Internet worm is a self-propagating program that automatically replicates itself to vulnerable systems and spreads across the Internet. Most deployed worm-detection systems are signature-based. They look for specific byte sequences (called attack signatures) that are known to appear in the attack traffic. Conventionally, the signatures are manually identified by human experts through careful analysis of the byte sequence from captured attack traffic. We propose to embed the traffic sequence to a high-dimensional phase space using chaos theory. We have observed that the signature sequence of a specific worm will occupy specific regions in the phase space, which may be appropriately called the invariant subspace of the worm. The invariant subspace of the worm separates itself widely from the subspace of the normal traffic. This separation allows us to construct three simple metrics, each of which completely separates 100 normal traffic streams from 200 worm traffic streams, without training in the conventional sense. Therefore, the method is at least as accurate as any existing methods. More importantly, our method is much faster than existing methods, such as based on expectation maximization and hidden Markov models.

Analysis and visualization of large complex attack graphs for networks security

Hongda Chen, Genshe Chen, Erik Blasch, et al.

Show abstract

In this paper, we have proposed a comprehensive and innovative approach for analysis and visualization of large complex multi-step cyber attack graphs. As an automated tool for cyber attack detection, prediction, and visualization, the newly proposed method transforms large quantities of network security data into real-time actionable intelligence, which can be used to (1) provide guidance on network hardening to prevent attacks, (2) perform real-time attack event correlation during active attacks, and (3) formulate post-attack responses. We show that it is possible to visualize the complex graphs, including all possible network attack paths while still keeping complexity manageable. The proposed analysis and visualization tool provides an efficient and effective solution for predicting potential attacks upon observed intrusion evidence, as well as interactive multi-resolution views such that an analyst can first obtain high-level overviews quickly, and then drill down to specific details.

Summary of results on optimal camera placement for boundary monitoring

Robert J. Holt, Hong Man, Rainer Martini, et al.

Show abstract

We consider the problems of placing cameras so that every point on a perimeter, that is not necessarily planar, is covered by at least one camera while using the smallest number of cameras. This is accomplished by aligning the edges of the cameras' fields of view with points on the boundary under surveillance. Taken into consideration are visibility concerns, where features such as mountains must not be allowed to come between a camera and a boundary point that would otherwise be in the camera's field of view. We provide a general algorithm that determines optimal camera placements and orientations. Additionally, we consider double coverings, where every boundary point is seen by at least two cameras, with selected boundary points and cameras situated such that the average calibration errors between adjacent cameras is minimized. We describe an iterative algorithm that accomplishes these tasks. We also consider a joint optimization algorithm, which strikes a balance between minimizing calibration error and the number of cameras required to cover the boundary.

Evaluation of data mining techniques for suspicious network activity classification using honeypots data

André Grégio, Rafael Santos, Antonio Montes

Show abstract

As the amount and types of remote network services increase, the analysis of their logs has become a very difficult and time consuming task. There are several ways to filter relevant information and provide a reduced log set for analysis, such as whitelisting and intrusion detection tools, but all of them require too much fine- tuning work and human expertise. Nowadays, researchers are evaluating data mining approaches for intrusion detection in network logs, using techniques such as genetic algorithms, neural networks, clustering algorithms, etc. Some of those techniques yield good results, yet requiring a very large number of attributes gathered by network traffic to detect useful information. In this work we apply and evaluate some data mining techniques (K-Nearest Neighbors, Artificial Neural Networks and Decision Trees) in a reduced number of attributes on some log data sets acquired from a real network and a honeypot, in order to classify traffic logs as normal or suspicious. The results obtained allow us to identify unlabeled logs and to describe which attributes were used for the decision. This approach provides a very reduced amount of logs to the network administrator, improving the analysis task and aiding in discovering new kinds of attacks against their networks.

Selection of intrusion detection system threshold bounds for effective sensor fusion

Ciza Thomas, Narayanaswamy Balakrishnan

Show abstract

The motivation behind the fusion of Intrusion Detection Systems was the realization that with the increasing traffic and increasing complexity of attacks, none of the present day stand-alone Intrusion Detection Systems can meet the high demand for a very high detection rate and an extremely low false positive rate. Multi-sensor fusion can be used to meet these requirements by a refinement of the combined response of different Intrusion Detection Systems. In this paper, we show the design technique of sensor fusion to best utilize the useful response from multiple sensors by an appropriate adjustment of the fusion threshold. The threshold is generally chosen according to the past experiences or by an expert system. In this paper, we show that the choice of threshold bounds according to the Chebyshev inequality priciple performs better. This approach also helps to solve the problem of scalability and has the advantage of failsafe capability. This paper theoretically models the fusion of Intrusion Detection Systems for the purpose of proving the improvement in performance, supplemented with the empirical evaluation. The combination of complementary sensors is shown to detect more attacks than the individual components. Since the individual sensors chosen detect sufficiently different attacks, their result can be merged for improved performance. The combination is done in different ways like (i) taking all the alarms from each system and avoiding duplications, (ii) taking alarms from each system by fixing threshold bounds, and (iii) rule-based fusion with a priori knowledge of the individual sensor performance. A number of evaluation metrics are used, and the results indicate that there is an overall enhancement in the performance of the combined detector using sensor fusion incorporating the threshold bounds and significantly better performance using simple rule-based fusion.

Mining unknown patterns in data when the features are correlated

Robert S. Lynch Jr., Peter K. Willett

Show abstract

In this paper, a previously introduced data mining technique, utilizing the Mean Field Bayesian Data Reduction Algorithm (BDRA), is extended for use in finding unknown data clusters in a fused multidimensional feature space. In extending the BDRA for this application its built-in dimensionality reduction aspects are exploited for isolating and automatically mining all points contained in each unknown cluster. In previous work, this approach was shown to have comparable performance to the classifier that knows all cluster information when mining up to two features containing multiple unknown clusters. However, unlike results shown in previous work based on lower dimensional feature spaces, the results in this paper are based on utilizing up to twenty fused features. This is due to improvements in the training algorithm that now mines for candidate data clusters by processing all points in a quantized cell simultaneously. This is opposed to the previous method that processed all points sequentially. This improvement in processing has resulted in a substantial reduction in the run time of the algorithm. Finally, performance is illustrated and compared with simulated data containing multiple clusters, and where the relevant feature space contains both correlated and uncorrelated classification information.

Image information mining from geospatial archives based on a combination of the wavelet transform and Fourier phase descriptor

Vijay P. Shah, Nicholas H. Younan, Surya S. Durbha, et al.

Show abstract

In general, reflectance and spatial patterns characterize geospatial data. Current semantic-enabled framework image retrieval systems for geospatial data extract primitive features based on color, texture (Spatial Gray Level Dependency - SGLD matrices), and shape from the segmented homogenous region. However, the form of extracting textural information is computationally expensive. The state-of-the-art image mining system for multimedia image archives uses the wavelet transform for feature extraction to quickly and efficiently capture color and texture information. Since an image consists of three bands, color information is captured by converting the RGB space into HSV space. Thus, a new approach is required to capture the complete reflectance pattern, an important characteristic of geospatial data. This work proposes a new method to perform fast coarse image segmentation using descriptors obtained by combining the 2Dwavelet transform along the spatial axis and the Fourier transform along the spectral axis to capture color and texture information for segmentation. These features are later on used for region-based retrieval in Earth observation data archives. Compared to traditional techniques, result shows that the proposed method provides good retrieval accuracy in terms of F-measure for land cover classes.

Genetic program based data mining of fuzzy decision trees and methods of improving convergence and reducing bloat

James F. Smith III, Thanh Vu H. Nguyen

Show abstract

A data mining procedure for automatic determination of fuzzy decision tree structure using a genetic program (GP) is discussed. A GP is an algorithm that evolves other algorithms or mathematical expressions. Innovative methods for accelerating convergence of the data mining procedure and reducing bloat are given. In genetic programming, bloat refers to excessive tree growth. It has been observed that the trees in the evolving GP population will grow by a factor of three every 50 generations. When evolving mathematical expressions much of the bloat is due to the expressions not being in algebraically simplest form. So a bloat reduction method based on automated computer algebra has been introduced. The effectiveness of this procedure is discussed. Also, rules based on fuzzy logic have been introduced into the GP to accelerate convergence, reduce bloat and produce a solution more readily understood by the human user. These rules are discussed as well as other techniques for convergence improvement and bloat control. Comparisons between trees created using a genetic program and those constructed solely by interviewing experts are made. A new co-evolutionary method that improves the control logic evolved by the GP by having a genetic algorithm evolve pathological scenarios is discussed. The effect on the control logic is considered. Finally, additional methods that have been used to validate the data mining algorithm are referenced.

Enabling distributed simulation multilevel security using virtual machine and virtual private network technology

Martin R. Stytz, Sheila B. Banks

Show abstract

Increasing the accuracy of the portrayal of all of the elements of a simulation environment has long been a prime goal of the modeling and simulation community; a goal that has remained far out of reach for many reasons. One of the greatest hurdles facing simulation developers in the effort to increase simulation accuracy is the need to segregate information across the entire simulation environment according to access restrictions in order to insure the integrity, security, and reliability requirements imposed on the data. However, this need for segregation does not mean that those with the highest access permissions should be forced to use multiple computers and displays to integrate the information that they need or that intelligent agents should be restricted in their access to the information that they need in order to adequately assist their human operators. In this paper, we present a potential solution to the problem of integrating and segregating data, which is the use of virtual machine and virtual private network technology in order to maintain segregation of data, control access, and control intercommunication.

Maximising information recovery from rank-order codes

B. Sen, S. Furber

Show abstract

The central nervous system encodes information in sequences of asynchronously generated voltage spikes, but the precise details of this encoding are not well understood. Thorpe proposed rank-order codes as an explanation of the observed speed of information processing in the human visual system. The work described in this paper is inspired by the performance of SpikeNET, a biologically inspired neural architecture using rank-order codes for information processing, and is based on the retinal model developed by VanRullen and Thorpe. This model mimics retinal information processing by passing an input image through a bank of Difference of Gaussian (DoG) filters and then encoding the resulting coefficients in rank-order. To test the effectiveness of this encoding in capturing the information content of an image, the rank-order representation is decoded to reconstruct an image that can be compared with the original. The reconstruction uses a look-up table to infer the filter coefficients from their rank in the encoded image. Since the DoG filters are approximately orthogonal functions, they are treated as their own inverses in the reconstruction process. We obtained a quantitative measure of the perceptually important information retained in the reconstructed image relative to the original using a slightly modified version of an objective metric proposed by Petrovic. It is observed that around 75% of the perceptually important information is retained in the reconstruction. In the present work we reconstruct the input using a pseudo-inverse of the DoG filter-bank with the aim of improving the reconstruction and thereby extracting more information from the rank-order encoded stimulus. We observe that there is an increase of 10 - 15% in the information retrieved from a reconstructed stimulus as a result of inverting the filter-bank.

Development of a model for assessing the impact of information assurance functionality on secure messaging system performance

Sheela V. Belur, Jonathan Gloster

Show abstract

An analytical performance model for a generic secure messaging system is formulated as a multi-class queuing network. The model includes assessment of the impact of security features such as secret key encryption/ decryption, signature generation/verification, and certificate validation, on overall performance. Findings of sensitivity analysis with respect to message rate, WAN transmission link, SSL encryption, message size, and distance between servers is also presented. Finally, the description of how the model can be adopted for making performance based architectural design options is outlined.

Cluster analysis of temporal trajectories of hospital laboratory examinations

Shoji Hirano, Shusaku Tsumoto

Show abstract

This paper proposes a new approach to temporal trajectory analysis for clinical laboratory examinations. When we select m laboratory examinations, their temporal evolution for one patient can be viewed as a trajectory in m-dimensional space. Multiscale comparison technique can be applied for segmentation and calculation of structural similarities of such trajectories. Then, clustering cna be applied to the calculated similarities for classiffication of these trajectories. The proposed method was evaluated on hepatitis datasets, whose results show that the clustering captured several interesting patterns for severe chronic hepatitis.

Discovery of exacerbating cases in chronic hepatitis based on cluster analysis of time-series platelet count data

Shoji Hirano, Shusaku Tsumoto

Show abstract

This paper reports the results of temporal analysis of platelet (PLT) data in chronic hepatitis dataset. First we briefly introduce a cluster analysis system for temporal data that we have developed. Second, we show the results of cluster analysis of PLT sequences. Third, we show the results of PLT value-based temporal analysis aiming at finding years for reaching F4, years elapsed between stages, and their relationships with virus types and fibrotic stages. The results of cluster analysis indicate that the temporal courses of PLT can be grouped into several patterns each of which presents similarity in average PLT level and increase/decrease trends. The results of value-based analysis suggests that liver fibrosis may proceed faster in the exacerbating cases.

Supporting online learning with games

JingTao Yao, DongWon Kim, Joseph P. Herbert

Show abstract

This paper presents a study on Web-based learning support systems that is enhanced with two major subsystems: a Web-based learning game and a learning-oriented Web search. The Internet and theWeb may be considered as a first resource for students seeking for information and help. However, much of the information available online is not related to the course contents or is wrong in the worse case. The search subsystem aims to provide students with precise, relative and adaptable documents about certain courses or classes. Therefore, students do not have to spend time to verify the relationship of documents to the class. The learning game subsystem stimulates students to study, enables students to review their studies and to perform self-evaluation through a Web-based learning game such as a treasure hunt game. During the challenge and entertaining learning and evaluation process, it is hoped that students will eventually understand and master the course concepts easily. The goal of developing such a system is to provide students with an efficient and effective learning environment.

AutoCorrel II: a neural network event correlation approach

Maxwell G. Dondo, Peter Mason, Nathalie Japkowicz, et al.

Show abstract

As a follow-up to our earlier model Autocorrel I, we have implemented a two-stage event correlation approach with improved performance. Like Autocorrel I, the new model correlates intrusion detection system (IDS) alerts to automate alert and incidents management, and reduce the workload on an IDS analyst. We achieve this correlation by clustering similar alerts, thus allowing the analyst to only consider a few clusters rather than hundreds or thousands of alerts. The first stage uses an artificial neural network (ANN)-based autoassociator (AA). The AA's objective is to attempt to reproduce each alert at its output. In the process, it uses an error metric, the reconstruction error (RE), between its input and output to cluster similar alerts. In order to improve the accuracy of the system we add another machine-learning stage which takes into account the RE as well as raw attribute information from the input alerts. This stage uses the Expectation-Maximisation (EM) clustering algorithm. The performance of this approach is tested with intrusion alerts generated by a Snort IDS on DARPA's 1999 IDS evaluation data as well as incidents.org alerts.

New metrics for blog mining

Brian Ulicny, Ken Baclawski, Amy Magnus

Show abstract

Blogs represent an important new arena for knowledge discovery in open source intelligence gathering. Bloggers are a vast network of human (and sometimes non-human) information sources monitoring important local and global events, and other blogs, for items of interest upon which they comment. Increasingly, issues erupt from the blog world and into the real world. In order to monitor blogging about important events, we must develop models and metrics that represent blogs correctly. The structure of blogs requires new techniques for evaluating such metrics as the relevance, specificity, credibility and timeliness of blog entries. Techniques that have been developed for standard information retrieval purposes (e.g. Google's PageRank) are suboptimal when applied to blogs because of their high degree of exophoricity, quotation, brevity, and rapidity of update. In this paper, we offer new metrics related for blog entry relevance, specificity, timeliness and credibility that we are implementing in a blog search and analysis tool for international blogs. This tools utilizes new blog-specific metrics and techniques for extracting the necessary information from blog entries automatically, using some shallow natural language processing techniques supported by background knowledge captured in domain-specific ontologies.

Adaptive Grahm-Schmidt orthogonalization for the projection-slice synthetic discriminant function filter

Vahid R. Riasati, Denis Grishin

Show abstract

One of the primary reasons for using the PSDF is the inherent data dimension reduction that is achieved through the use of the projection-slice theorem, (PST). The use of the PST allows for a powerful technique to segment informational content into lower dimensional spans while simultaneously providing a complete and naturally linked data set. Shannon provides a formula for maximum information capacity in a channel that utilizes as the most efficient informational coding independent data samples, as this spreads the spectrum evenly across the channel band. This implies that independent information that represent a correlated set contains maximal information if the "key" that represents the correlated-ness of the data is known, otherwise the independent data are purely random. By using a novel Adaptive Grahm-Schmidt (AGS) procedure we form a method that identifies patterns in correlated data for the removal of the inter-dependence and thereby maximization of information content per data sample. In this work we subject the lower-dimensional data sets in the PSDF to the AGS to maximize the information content of the PSDF and share some of our findings, results, and deductions.

Semantic search via concept annealing

Kirk A. Dunkelberger

Show abstract

Annealing, in metallurgy and materials science, is a heat treatment wherein the microstructure of a material is altered, causing changes in its properties such as strength and hardness. We define concept annealing as a lexical, syntactic, and semantic expansion capability (the removal of defects and the internal stresses that cause term- and phrase-based search failure) coupled with a directed contraction capability (semantically-related terms, queries, and concepts nucleate and grow to replace those originally deformed by internal stresses). These two capabilities are tied together in a control loop mediated by the information retrieval precision and recall metrics coupled with intuition provided by the operator. The specific representations developed have been targeted at facilitating highly efficient and effective semantic indexing and searching. This new generation of Find capability enables additional processing (i.e. all-source tracking, relationship extraction, and total system resource management) at rates, precisions, and accuracies previously considered infeasible. In a recent experiment, an order magnitude reduction in time to actionable intelligence and nearly three orderss magnitude reduction in false alarm rate was achieved.

Three-way aspect model (TWAPM) and co-training for image retrieval

Anca Doloc-Mihu, Vijay V. Raghavan

Show abstract

The goal of this work is to investigate the applicability of two probabilistic approaches, namely Three-Way Aspect Model (TWAPM) and Co-training, to retrieve images represented by multiple feature types from large image collections. We test these approaches in a learning context via relevance feedback from user. Our experiments show that Co-training may be a better choice than TWAPM for a Web-based Image Retrieval System.

A flexible self-learning model based on granular computing

Ting Wei, Yu Wu, Yinguo Li

Show abstract

Granular Computing(GrC) is an emerging theory which simulates the process of human brain understanding and solving problems. Rough set theory is a tool for dealing with uncertainty and vagueness aspects of knowledge model. SMLGrC algorithm introduces GrC to classical rough set algorithms, and makes the length of the rules relatively short but it can not process mass data sets. In order to solve this problem, based on the analysis of the hierarchical granular model of information table, the method of Granular Distribution List(GDL) is introduced to generate granule, and a granular computing algorithm(SLMGrC) is improved. Sample Covered Factor(SCF) is also introduced to control the generation of rules when the algorithm generates conflicting rules. The improved algorithm can process mass data sets directly without influencing the validity of SLMGrC. Experiments demonstrated the validity and flexibility of our method.

Selecting materialized views using random algorithm

Lijuan Zhou, Zhongxiao Hao, Chi Liu

Show abstract

The data warehouse is a repository of information collected from multiple possibly heterogeneous autonomous distributed databases. The information stored at the data warehouse is in form of views referred to as materialized views. The selection of the materialized views is one of the most important decisions in designing a data warehouse. Materialized views are stored in the data warehouse for the purpose of efficiently implementing on-line analytical processing queries. The first issue for the user to consider is query response time. So in this paper, we develop algorithms to select a set of views to materialize in data warehouse in order to minimize the total view maintenance cost under the constraint of a given query response time. We call it query_cost view_ selection problem. First, cost graph and cost model of query_cost view_ selection problem are presented. Second, the methods for selecting materialized views by using random algorithms are presented. The genetic algorithm is applied to the materialized views selection problem. But with the development of genetic process, the legal solution produced become more and more difficult, so a lot of solutions are eliminated and producing time of the solutions is lengthened in genetic algorithm. Therefore, improved algorithm has been presented in this paper, which is the combination of simulated annealing algorithm and genetic algorithm for the purpose of solving the query cost view selection problem. Finally, in order to test the function and efficiency of our algorithms experiment simulation is adopted. The experiments show that the given methods can provide near-optimal solutions in limited time and works better in practical cases. Randomized algorithms will become invaluable tools for data warehouse evolution.

Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security 2007

Volume Details

Table of Contents

Table of Contents