8 resultados para correlation-based feature selection
em Digital Commons at Florida International University
Resumo:
With the rapid growth of the Internet, computer attacks are increasing at a fast pace and can easily cause millions of dollar in damage to an organization. Detecting these attacks is an important issue of computer security. There are many types of attacks and they fall into four main categories, Denial of Service (DoS) attacks, Probe, User to Root (U2R) attacks, and Remote to Local (R2L) attacks. Within these categories, DoS and Probe attacks continuously show up with greater frequency in a short period of time when they attack systems. They are different from the normal traffic data and can be easily separated from normal activities. On the contrary, U2R and R2L attacks are embedded in the data portions of the packets and normally involve only a single connection. It becomes difficult to achieve satisfactory detection accuracy for detecting these two attacks. Therefore, we focus on studying the ambiguity problem between normal activities and U2R/R2L attacks. The goal is to build a detection system that can accurately and quickly detect these two attacks. In this dissertation, we design a two-phase intrusion detection approach. In the first phase, a correlation-based feature selection algorithm is proposed to advance the speed of detection. Features with poor prediction ability for the signatures of attacks and features inter-correlated with one or more other features are considered redundant. Such features are removed and only indispensable information about the original feature space remains. In the second phase, we develop an ensemble intrusion detection system to achieve accurate detection performance. The proposed method includes multiple feature selecting intrusion detectors and a data mining intrusion detector. The former ones consist of a set of detectors, and each of them uses a fuzzy clustering technique and belief theory to solve the ambiguity problem. The latter one applies data mining technique to automatically extract computer users’ normal behavior from training network traffic data. The final decision is a combination of the outputs of feature selecting and data mining detectors. The experimental results indicate that our ensemble approach not only significantly reduces the detection time but also effectively detect U2R and R2L attacks that contain degrees of ambiguous information.
Resumo:
With the rapid growth of the Internet, computer attacks are increasing at a fast pace and can easily cause millions of dollar in damage to an organization. Detecting these attacks is an important issue of computer security. There are many types of attacks and they fall into four main categories, Denial of Service (DoS) attacks, Probe, User to Root (U2R) attacks, and Remote to Local (R2L) attacks. Within these categories, DoS and Probe attacks continuously show up with greater frequency in a short period of time when they attack systems. They are different from the normal traffic data and can be easily separated from normal activities. On the contrary, U2R and R2L attacks are embedded in the data portions of the packets and normally involve only a single connection. It becomes difficult to achieve satisfactory detection accuracy for detecting these two attacks. Therefore, we focus on studying the ambiguity problem between normal activities and U2R/R2L attacks. The goal is to build a detection system that can accurately and quickly detect these two attacks. In this dissertation, we design a two-phase intrusion detection approach. In the first phase, a correlation-based feature selection algorithm is proposed to advance the speed of detection. Features with poor prediction ability for the signatures of attacks and features inter-correlated with one or more other features are considered redundant. Such features are removed and only indispensable information about the original feature space remains. In the second phase, we develop an ensemble intrusion detection system to achieve accurate detection performance. The proposed method includes multiple feature selecting intrusion detectors and a data mining intrusion detector. The former ones consist of a set of detectors, and each of them uses a fuzzy clustering technique and belief theory to solve the ambiguity problem. The latter one applies data mining technique to automatically extract computer users’ normal behavior from training network traffic data. The final decision is a combination of the outputs of feature selecting and data mining detectors. The experimental results indicate that our ensemble approach not only significantly reduces the detection time but also effectively detect U2R and R2L attacks that contain degrees of ambiguous information.
Resumo:
Hazardous materials are substances that, if not regulated, can pose a threat to human populations and their environmental health, safety or property when transported in commerce. About 1.5 million tons of hazardous material shipments are transported by truck in the US annually, with a steady increase of approximately 5% per year. The objective of this study was to develop a routing tool for hazardous material transport in order to facilitate reduced environmental impacts and less transportation difficulties, yet would also find paths that were still compelling for the shipping carriers as a matter of trucking cost. The study started with identification of inhalation hazard impact zones and explosion protective areas around the location of hypothetical hazardous material releases, considering different parameters (i.e., chemicals characteristics, release quantities, atmospheric condition, etc.). Results showed that depending on the quantity of release, chemical, and atmospheric stability (a function of wind speed, meteorology, sky cover, time and location of accidents, etc.) the consequence of these incidents can differ. The study was extended by selection of other evaluation criteria for further investigation because health risk as an evaluation criterion would not be the only concern in selection of routes. Transportation difficulties (i.e., road blockage and congestion) were incorporated as important factor due to their indirect impact/cost on the users of transportation networks. Trucking costs were also considered as one of the primary criteria in selection of hazardous material paths; otherwise the suggested routes would have not been convincing for the shipping companies. The last but not least criterion was proximity of public places to the routes. The approach evolved from a simple framework to a complicated and efficient GIS-based tool able to investigate transportation networks of any given study area, and capable of generating best routing options for cargos. The suggested tool uses a multi-criteria-decision-making method, which considers the priorities of the decision makers in choosing the cargo routes. Comparison of the routing options based on each criterion and also the overall suitableness of the path in regards to all the criteria (using a multi-criteria-decision-making method) showed that using similar tools as the one proposed by this study can provide decision makers insights in the area of hazardous material transport. This tool shows the probable consequences of considering each path in a very easily understandable way; in the formats of maps and tables, which makes the tradeoffs of costs and risks considerably simpler, as in some cases slightly compromising on trucking cost may drastically decrease the probable health risk and/or traffic difficulties. This will not only be rewarding to the community by making cities safer places to live, but also can be beneficial to shipping companies by allowing them to advertise as environmental friendly conveyors.
Resumo:
Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.
Resumo:
Since the seminal works of Markowitz (1952), Sharpe (1964), and Lintner (1965), numerous studies on portfolio selection and performance measure have been based upon the mean-variance framework. However, several researchers (e.g., Arditti (1967, and 1971), Samuelson (1970), and Rubinstein (1973)) argue that the higher moments cannot be neglected unless there is reason to believe that: (i) the asset returns are normally distributed and the investor's utility function is quadratic, or (ii) the empirical evidence demonstrates that higher moments are irrelevant to the investor's decision. Based on the same argument, this dissertation investigates the impact of higher moments of return distributions on three issues concerning the 14 international stock markets.^ First, the portfolio selection with skewness is determined using: the Polynomial Goal Programming in which investor preferences for skewness can be incorporated. The empirical findings suggest that the return distributions of international stock markets are not normally distributed, and that the incorporation of skewness into an investor's portfolio decision causes a major change in the construction of his optimal portfolio. The evidence also indicates that an investor will trade expected return of the portfolio for skewness. Moreover, when short sales are allowed, investors are better off as they attain higher expected return and skewness simultaneously.^ Second, the performance of international stock markets are evaluated using two types of performance measures: (i) the two-moment performance measures of Sharpe (1966), and Treynor (1965), and (ii) the higher-moment performance measures of Prakash and Bear (1986), and Stephens and Proffitt (1991). The empirical evidence indicates that higher moments of return distributions are significant and relevant to the investor's decision. Thus, the higher moment performance measures should be more appropriate to evaluate the performances of international stock markets. The evidence also indicates that various measures provide a vastly different performance ranking of the markets, albeit in the same direction.^ Finally, the inter-temporal stability of the international stock markets is investigated using the Parhizgari and Prakash (1989) algorithm for the Sen and Puri (1968) test which accounts for non-normality of return distributions. The empirical finding indicates that there is strong evidence to support the stability in international stock market movements. However, when the Anderson test which assumes normality of return distributions is employed, the stability in the correlation structure is rejected. This suggests that the non-normality of the return distribution is an important factor that cannot be ignored in the investigation of inter-temporal stability of international stock markets. ^
Resumo:
Computer networks produce tremendous amounts of event-based data that can be collected and managed to support an increasing number of new classes of pervasive applications. Examples of such applications are network monitoring and crisis management. Although the problem of distributed event-based management has been addressed in the non-pervasive settings such as the Internet, the domain of pervasive networks has its own characteristics that make these results non-applicable. Many of these applications are based on time-series data that possess the form of time-ordered series of events. Such applications also embody the need to handle large volumes of unexpected events, often modified on-the-fly, containing conflicting information, and dealing with rapidly changing contexts while producing results with low-latency. Correlating events across contextual dimensions holds the key to expanding the capabilities and improving the performance of these applications. This dissertation addresses this critical challenge. It establishes an effective scheme for complex-event semantic correlation. The scheme examines epistemic uncertainty in computer networks by fusing event synchronization concepts with belief theory. Because of the distributed nature of the event detection, time-delays are considered. Events are no longer instantaneous, but duration is associated with them. Existing algorithms for synchronizing time are split into two classes, one of which is asserted to provide a faster means for converging time and hence better suited for pervasive network management. Besides the temporal dimension, the scheme considers imprecision and uncertainty when an event is detected. A belief value is therefore associated with the semantics and the detection of composite events. This belief value is generated by a consensus among participating entities in a computer network. The scheme taps into in-network processing capabilities of pervasive computer networks and can withstand missing or conflicting information gathered from multiple participating entities. Thus, this dissertation advances knowledge in the field of network management by facilitating the full utilization of characteristics offered by pervasive, distributed and wireless technologies in contemporary and future computer networks.
Resumo:
Computer networks produce tremendous amounts of event-based data that can be collected and managed to support an increasing number of new classes of pervasive applications. Examples of such applications are network monitoring and crisis management. Although the problem of distributed event-based management has been addressed in the non-pervasive settings such as the Internet, the domain of pervasive networks has its own characteristics that make these results non-applicable. Many of these applications are based on time-series data that possess the form of time-ordered series of events. Such applications also embody the need to handle large volumes of unexpected events, often modified on-the-fly, containing conflicting information, and dealing with rapidly changing contexts while producing results with low-latency. Correlating events across contextual dimensions holds the key to expanding the capabilities and improving the performance of these applications. This dissertation addresses this critical challenge. It establishes an effective scheme for complex-event semantic correlation. The scheme examines epistemic uncertainty in computer networks by fusing event synchronization concepts with belief theory. Because of the distributed nature of the event detection, time-delays are considered. Events are no longer instantaneous, but duration is associated with them. Existing algorithms for synchronizing time are split into two classes, one of which is asserted to provide a faster means for converging time and hence better suited for pervasive network management. Besides the temporal dimension, the scheme considers imprecision and uncertainty when an event is detected. A belief value is therefore associated with the semantics and the detection of composite events. This belief value is generated by a consensus among participating entities in a computer network. The scheme taps into in-network processing capabilities of pervasive computer networks and can withstand missing or conflicting information gathered from multiple participating entities. Thus, this dissertation advances knowledge in the field of network management by facilitating the full utilization of characteristics offered by pervasive, distributed and wireless technologies in contemporary and future computer networks.
Resumo:
Thanks to the advanced technologies and social networks that allow the data to be widely shared among the Internet, there is an explosion of pervasive multimedia data, generating high demands of multimedia services and applications in various areas for people to easily access and manage multimedia data. Towards such demands, multimedia big data analysis has become an emerging hot topic in both industry and academia, which ranges from basic infrastructure, management, search, and mining to security, privacy, and applications. Within the scope of this dissertation, a multimedia big data analysis framework is proposed for semantic information management and retrieval with a focus on rare event detection in videos. The proposed framework is able to explore hidden semantic feature groups in multimedia data and incorporate temporal semantics, especially for video event detection. First, a hierarchical semantic data representation is presented to alleviate the semantic gap issue, and the Hidden Coherent Feature Group (HCFG) analysis method is proposed to capture the correlation between features and separate the original feature set into semantic groups, seamlessly integrating multimedia data in multiple modalities. Next, an Importance Factor based Temporal Multiple Correspondence Analysis (i.e., IF-TMCA) approach is presented for effective event detection. Specifically, the HCFG algorithm is integrated with the Hierarchical Information Gain Analysis (HIGA) method to generate the Importance Factor (IF) for producing the initial detection results. Then, the TMCA algorithm is proposed to efficiently incorporate temporal semantics for re-ranking and improving the final performance. At last, a sampling-based ensemble learning mechanism is applied to further accommodate the imbalanced datasets. In addition to the multimedia semantic representation and class imbalance problems, lack of organization is another critical issue for multimedia big data analysis. In this framework, an affinity propagation-based summarization method is also proposed to transform the unorganized data into a better structure with clean and well-organized information. The whole framework has been thoroughly evaluated across multiple domains, such as soccer goal event detection and disaster information management.