301 resultados para stream mining
Resumo:
This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2009 XML Mining track. The report also describes the approaches and results obtained by the different participants.
Resumo:
To date, most applications of algebraic analysis and attacks on stream ciphers are on those based on lin- ear feedback shift registers (LFSRs). In this paper, we extend algebraic analysis to non-LFSR based stream ciphers. Specifically, we perform an algebraic analysis on the RC4 family of stream ciphers, an example of stream ciphers based on dynamic tables, and inves- tigate its implications to potential algebraic attacks on the cipher. This is, to our knowledge, the first pa- per that evaluates the security of RC4 against alge- braic attacks through providing a full set of equations that describe the complex word manipulations in the system. For an arbitrary word size, we derive alge- braic representations for the three main operations used in RC4, namely state extraction, word addition and state permutation. Equations relating the inter- nal states and keystream of RC4 are then obtained from each component of the cipher based on these al- gebraic representations, and analysed in terms of their contributions to the security of RC4 against algebraic attacks. Interestingly, it is shown that each of the three main operations contained in the components has its own unique algebraic properties, and when their respective equations are combined, the resulting system becomes infeasible to solve. This results in a high level of security being achieved by RC4 against algebraic attacks. On the other hand, the removal of an operation from the cipher could compromise this security. Experiments on reduced versions of RC4 have been performed, which confirms the validity of our algebraic analysis and the conclusion that the full RC4 stream cipher seems to be immune to algebraic attacks at present.
Resumo:
The high morbidity and mortality associated with atherosclerotic coronary vascular disease (CVD) and its complications are being lessened by the increased knowledge of risk factors, effective preventative measures and proven therapeutic interventions. However, significant CVD morbidity remains and sudden cardiac death continues to be a presenting feature for some subsequently diagnosed with CVD. Coronary vascular disease is also the leading cause of anaesthesia related complications. Stress electrocardiography/exercise testing is predictive of 10 year risk of CVD events and the cardiovascular variables used to score this test are monitored peri-operatively. Similar physiological time-series datasets are being subjected to data mining methods for the prediction of medical diagnoses and outcomes. This study aims to find predictors of CVD using anaesthesia time-series data and patient risk factor data. Several pre-processing and predictive data mining methods are applied to this data. Physiological time-series data related to anaesthetic procedures are subjected to pre-processing methods for removal of outliers, calculation of moving averages as well as data summarisation and data abstraction methods. Feature selection methods of both wrapper and filter types are applied to derived physiological time-series variable sets alone and to the same variables combined with risk factor variables. The ability of these methods to identify subsets of highly correlated but non-redundant variables is assessed. The major dataset is derived from the entire anaesthesia population and subsets of this population are considered to be at increased anaesthesia risk based on their need for more intensive monitoring (invasive haemodynamic monitoring and additional ECG leads). Because of the unbalanced class distribution in the data, majority class under-sampling and Kappa statistic together with misclassification rate and area under the ROC curve (AUC) are used for evaluation of models generated using different prediction algorithms. The performance based on models derived from feature reduced datasets reveal the filter method, Cfs subset evaluation, to be most consistently effective although Consistency derived subsets tended to slightly increased accuracy but markedly increased complexity. The use of misclassification rate (MR) for model performance evaluation is influenced by class distribution. This could be eliminated by consideration of the AUC or Kappa statistic as well by evaluation of subsets with under-sampled majority class. The noise and outlier removal pre-processing methods produced models with MR ranging from 10.69 to 12.62 with the lowest value being for data from which both outliers and noise were removed (MR 10.69). For the raw time-series dataset, MR is 12.34. Feature selection results in reduction in MR to 9.8 to 10.16 with time segmented summary data (dataset F) MR being 9.8 and raw time-series summary data (dataset A) being 9.92. However, for all time-series only based datasets, the complexity is high. For most pre-processing methods, Cfs could identify a subset of correlated and non-redundant variables from the time-series alone datasets but models derived from these subsets are of one leaf only. MR values are consistent with class distribution in the subset folds evaluated in the n-cross validation method. For models based on Cfs selected time-series derived and risk factor (RF) variables, the MR ranges from 8.83 to 10.36 with dataset RF_A (raw time-series data and RF) being 8.85 and dataset RF_F (time segmented time-series variables and RF) being 9.09. The models based on counts of outliers and counts of data points outside normal range (Dataset RF_E) and derived variables based on time series transformed using Symbolic Aggregate Approximation (SAX) with associated time-series pattern cluster membership (Dataset RF_ G) perform the least well with MR of 10.25 and 10.36 respectively. For coronary vascular disease prediction, nearest neighbour (NNge) and the support vector machine based method, SMO, have the highest MR of 10.1 and 10.28 while logistic regression (LR) and the decision tree (DT) method, J48, have MR of 8.85 and 9.0 respectively. DT rules are most comprehensible and clinically relevant. The predictive accuracy increase achieved by addition of risk factor variables to time-series variable based models is significant. The addition of time-series derived variables to models based on risk factor variables alone is associated with a trend to improved performance. Data mining of feature reduced, anaesthesia time-series variables together with risk factor variables can produce compact and moderately accurate models able to predict coronary vascular disease. Decision tree analysis of time-series data combined with risk factor variables yields rules which are more accurate than models based on time-series data alone. The limited additional value provided by electrocardiographic variables when compared to use of risk factors alone is similar to recent suggestions that exercise electrocardiography (exECG) under standardised conditions has limited additional diagnostic value over risk factor analysis and symptom pattern. The effect of the pre-processing used in this study had limited effect when time-series variables and risk factor variables are used as model input. In the absence of risk factor input, the use of time-series variables after outlier removal and time series variables based on physiological variable values’ being outside the accepted normal range is associated with some improvement in model performance.
Resumo:
The Queensland Coal Industry Employees Health Scheme was implemented in 1993 to provide health surveillance for all Queensland coal industry workers. Tt1e government, mining employers and mining unions agreed that the scheme should operate for seven years. At the expiry of the scheme, an assessment of the contribution of health surveillance to meet coal industry needs would be an essential part of determining a future health surveillance program. This research project has analysed the data made available between 1993 and 1998. All current coal industry employees have had at least one health assessment. The project examined how the centralised nature of the Health Scheme benefits industry by identi~)jng key health issues and exploring their dimensions on a scale not possible by corporate based health surveillance programs. There is a body of evidence that indicates that health awareness - on the scale of the individual, the work group and the industry is not a part of the mining industry culture. There is also growing evidence that there is a need for this culture to change and that some change is in progress. One element of this changing culture is a growth in the interest by the individual and the community in information on health status and benchmarks that are reasonably attainable. This interest opens the way for health education which contains personal, community and occupational elements. An important element of such education is the data on mine site health status. This project examined the role of health surveillance in the coal mining industry as a tool for generating the necessary information to promote an interest in health awareness. The Health Scheme Database provides the material for the bulk of the analysis of this project. After a preliminary scan of the data set, more detailed analysis was undertaken on key health and related safety issues that include respiratory disorders, hearing loss and high blood pressure. The data set facilitates control for confounding factors such as age and smoking status. Mines can be benchmarked to identify those mines with effective health management and those with particular challenges. While the study has confirmed the very low prevalence of restrictive airway disease such as pneu"moconiosis, it has demonstrated a need to examine in detail the emergence of obstructive airway disease such as bronchitis and emphysema which may be a consequence of the increasing use of high dust longwall technology. The power of the Health Database's electronic data management is demonstrated by linking the health data to other data sets such as injury data that is collected by the Department of l\1mes and Energy. The analysis examines serious strain -sprain injuries and has identified a marked difference between the underground and open cut sectors of the industry. The analysis also considers productivity and OHS data to examine the extent to which there is correlation between any pairs ofJpese and previously analysed health parameters. This project has demonstrated that the current structure of the Coal Industry Employees Health Scheme has largely delivered to mines and effective health screening process. At the same time, the centralised nature of data collection and analysis has provided to the mines, the unions and the government substantial statistical cross-sectional data upon which strategies to more effectively manage health and relates safety issues can be based.
Resumo:
Stream ciphers are encryption algorithms used for ensuring the privacy of digital telecommunications. They have been widely used for encrypting military communications, satellite communications, pay TV encryption and for voice encryption of both fixed lined and wireless networks. The current multi year European project eSTREAM, which aims to select stream ciphers suitable for widespread adoptation, reflects the importance of this area of research. Stream ciphers consist of a keystream generator and an output function. Keystream generators produce a sequence that appears to be random, which is combined with the plaintext message using the output function. Most commonly, the output function is binary addition modulo two. Cryptanalysis of these ciphers focuses largely on analysis of the keystream generators and of relationships between the generator and the keystream it produces. Linear feedback shift registers are widely used components in building keystream generators, as the sequences they produce are well understood. Many types of attack have been proposed for breaking various LFSR based stream ciphers. A recent attack type is known as an algebraic attack. Algebraic attacks transform the problem of recovering the key into a problem of solving multivariate system of equations, which eventually recover the internal state bits or the key bits. This type of attack has been shown to be effective on a number of regularly clocked LFSR based stream ciphers. In this thesis, algebraic attacks are extended to a number of well known stream ciphers where at least one LFSR in the system is irregularly clocked. Applying algebriac attacks to these ciphers has only been discussed previously in the open literature for LILI-128. In this thesis, algebraic attacks are first applied to keystream generators using stop-and go clocking. Four ciphers belonging to this group are investigated: the Beth-Piper stop-and-go generator, the alternating step generator, the Gollmann cascade generator and the eSTREAM candidate: the Pomaranch cipher. It is shown that algebraic attacks are very effective on the first three of these ciphers. Although no effective algebraic attack was found for Pomaranch, the algebraic analysis lead to some interesting findings including weaknesses that may be exploited in future attacks. Algebraic attacks are then applied to keystream generators using (p; q) clocking. Two well known examples of such ciphers, the step1/step2 generator and the self decimated generator are investigated. Algebraic attacks are shown to be very powerful attack in recovering the internal state of these generators. A more complex clocking mechanism than either stop-and-go or the (p; q) clocking keystream generators is known as mutual clock control. In mutual clock control generators, the LFSRs control the clocking of each other. Four well known stream ciphers belonging to this group are investigated with respect to algebraic attacks: the Bilateral-stop-and-go generator, A5/1 stream cipher, Alpha 1 stream cipher, and the more recent eSTREAM proposal, the MICKEY stream ciphers. Some theoretical results with regards to the complexity of algebraic attacks on these ciphers are presented. The algebraic analysis of these ciphers showed that generally, it is hard to generate the system of equations required for an algebraic attack on these ciphers. As the algebraic attack could not be applied directly on these ciphers, a different approach was used, namely guessing some bits of the internal state, in order to reduce the degree of the equations. Finally, an algebraic attack on Alpha 1 that requires only 128 bits of keystream to recover the 128 internal state bits is presented. An essential process associated with stream cipher proposals is key initialization. Many recently proposed stream ciphers use an algorithm to initialize the large internal state with a smaller key and possibly publicly known initialization vectors. The effect of key initialization on the performance of algebraic attacks is also investigated in this thesis. The relationships between the two have not been investigated before in the open literature. The investigation is conducted on Trivium and Grain-128, two eSTREAM ciphers. It is shown that the key initialization process has an effect on the success of algebraic attacks, unlike other conventional attacks. In particular, the key initialization process allows an attacker to firstly generate a small number of equations of low degree and then perform an algebraic attack using multiple keystreams. The effect of the number of iterations performed during key initialization is investigated. It is shown that both the number of iterations and the maximum number of initialization vectors to be used with one key should be carefully chosen. Some experimental results on Trivium and Grain-128 are then presented. Finally, the security with respect to algebraic attacks of the well known LILI family of stream ciphers, including the unbroken LILI-II, is investigated. These are irregularly clock- controlled nonlinear filtered generators. While the structure is defined for the LILI family, a particular paramater choice defines a specific instance. Two well known such instances are LILI-128 and LILI-II. The security of these and other instances is investigated to identify which instances are vulnerable to algebraic attacks. The feasibility of recovering the key bits using algebraic attacks is then investigated for both LILI- 128 and LILI-II. Algebraic attacks which recover the internal state with less effort than exhaustive key search are possible for LILI-128 but not for LILI-II. Given the internal state at some point in time, the feasibility of recovering the key bits is also investigated, showing that the parameters used in the key initialization process, if poorly chosen, can lead to a key recovery using algebraic attacks.
Resumo:
Keyword Spotting is the task of detecting keywords of interest within continu- ous speech. The applications of this technology range from call centre dialogue systems to covert speech surveillance devices. Keyword spotting is particularly well suited to data mining tasks such as real-time keyword monitoring and unre- stricted vocabulary audio document indexing. However, to date, many keyword spotting approaches have su®ered from poor detection rates, high false alarm rates, or slow execution times, thus reducing their commercial viability. This work investigates the application of keyword spotting to data mining tasks. The thesis makes a number of major contributions to the ¯eld of keyword spotting. The ¯rst major contribution is the development of a novel keyword veri¯cation method named Cohort Word Veri¯cation. This method combines high level lin- guistic information with cohort-based veri¯cation techniques to obtain dramatic improvements in veri¯cation performance, in particular for the problematic short duration target word class. The second major contribution is the development of a novel audio document indexing technique named Dynamic Match Lattice Spotting. This technique aug- ments lattice-based audio indexing principles with dynamic sequence matching techniques to provide robustness to erroneous lattice realisations. The resulting algorithm obtains signi¯cant improvement in detection rate over lattice-based audio document indexing while still maintaining extremely fast search speeds. The third major contribution is the study of multiple veri¯er fusion for the task of keyword veri¯cation. The reported experiments demonstrate that substantial improvements in veri¯cation performance can be obtained through the fusion of multiple keyword veri¯ers. The research focuses on combinations of speech background model based veri¯ers and cohort word veri¯ers. The ¯nal major contribution is a comprehensive study of the e®ects of limited training data for keyword spotting. This study is performed with consideration as to how these e®ects impact the immediate development and deployment of speech technologies for non-English languages.
Resumo:
In a seminal data mining article, Leo Breiman [1] argued that to develop effective predictive classification and regression models, we need to move away from the sole dependency on statistical algorithms and embrace a wider toolkit of modeling algorithms that include data mining procedures. Nevertheless, many researchers still rely solely on statistical procedures when undertaking data modeling tasks; the sole reliance on these procedures has lead to the development of irrelevant theory and questionable research conclusions ([1], p.199). We will outline initiatives that the HPC & Research Support group is undertaking to engage researchers with data mining tools and techniques; including a new range of seminars, workshops, and one-on-one consultations covering data mining algorithms, the relationship between data mining and the research cycle, and limitations and problems with these new algorithms. Organisational limitations and restrictions to these initiatives are also discussed.