121 resultados para Automatic Analysis of Multivariate Categorical Data Sets


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Chromatographic fingerprints of 46 Eucommia Bark samples were obtained by liquid chromatography-diode array detector (LC-DAD). These samples were collected from eight provinces in China, with different geographical locations, and climates. Seven common LC peaks that could be used for fingerprinting this common popular traditional Chinese medicine were found, and six were identified as substituted resinols (4 compounds), geniposidic acid and chlorogenic acid by LC-MS. Principal components analysis (PCA) indicated that samples from the Sichuan, Hubei, Shanxi and Anhui—the SHSA provinces, clustered together. The other objects from the four provinces, Guizhou, Jiangxi, Gansu and Henan, were discriminated and widely scattered on the biplot in four province clusters. The SHSA provinces are geographically close together while the others are spread out. Thus, such results suggested that the composition of the Eucommia Bark samples was dependent on their geographic location and environment. In general, the basis for discrimination on the PCA biplot from the original 46 objects× 7 variables data matrix was the same as that for the SHSA subset (36 × 7 matrix). The seven marker compound loading vectors grouped into three sets: (1) three closely correlating substituted resinol compounds and chlorogenic acid; (2) the fourth resinol compound identified by the OCH3 substituent in the R4 position, and an unknown compound; and (3) the geniposidic acid, which was independent of the set 1 variables, and which negatively correlated with the set 2 ones above. These observations from the PCA biplot were supported by hierarchical cluster analysis, and indicated that Eucommia Bark preparations may be successfully compared with the use of the HPLC responses from the seven marker compounds and chemometric methods such as PCA and the complementary hierarchical cluster analysis (HCA).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This article explores two matrix methods to induce the ``shades of meaning" (SoM) of a word. A matrix representation of a word is computed from a corpus of traces based on the given word. Non-negative Matrix Factorisation (NMF) and Singular Value Decomposition (SVD) compute a set of vectors corresponding to a potential shade of meaning. The two methods were evaluated based on loss of conditional entropy with respect to two sets of manually tagged data. One set reflects concepts generally appearing in text, and the second set comprises words used for investigations into word sense disambiguation. Results show that for NMF consistently outperforms SVD for inducing both SoM of general concepts as well as word senses. The problem of inducing the shades of meaning of a word is more subtle than that of word sense induction and hence relevant to thematic analysis of opinion where nuances of opinion can arise.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Recent studies have shown that delusion-like experiences (DLEs) are common among general populations. This study investigates whether the prevalence of these experiences are linked to the embracing of New Age thought. Logistic regression analyses were performed using data derived from a large community sample of young adults (N = 3777). Belief in a spiritual or higher power other than God was found to be significantly associated with endorsement of 16 of 19 items from Peters et al. (1999b) Delusional Inventory following adjustment for a range of potential confounders, while belief in God was associated with endorsement of four items. A New Age conception of the divine appears to be strongly associated with a wide range of DLEs. Further research is needed to determine a causal link between New Age philosophy and DLEs (e.g. thought disturbance, suspiciousness, and delusions of grandeur).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Purpose: This study explored the spatial distribution of notified cryptosporidiosis cases and identified major socioeconomic factors associated with the transmission of cryptosporidiosis in Brisbane, Australia. Methods: We obtained the computerized data sets on the notified cryptosporidiosis cases and their key socioeconomic factors by statistical local area (SLA) in Brisbane for the period of 1996 to 2004 from the Queensland Department of Health and Australian Bureau of Statistics, respectively. We used spatial empirical Bayes rates smoothing to estimate the spatial distribution of cryptosporidiosis cases. A spatial classification and regression tree (CART) model was developed to explore the relationship between socioeconomic factors and the incidence rates of cryptosporidiosis. Results: Spatial empirical Bayes analysis reveals that the cryptosporidiosis infections were primarily concentrated in the northwest and southeast of Brisbane. A spatial CART model shows that the relative risk for cryptosporidiosis transmission was 2.4 when the value of the social economic index for areas (SEIFA) was over 1028 and the proportion of residents with low educational attainment in an SLA exceeded 8.8%. Conclusions: There was remarkable variation in spatial distribution of cryptosporidiosis infections in Brisbane. Spatial pattern of cryptosporidiosis seems to be associated with SEIFA and the proportion of residents with low education attainment.

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Transport Certification Australia Limited, jointly with the National Transport Commission, has undertaken a project to investigate the feasibility of on-board mass monitoring (OBM) devices for regulatory purposes. OBM increases jurisdictional confidence in operational heavy vehicle compliance. This paper covers technical issues regarding potential use of dynamic data from OBM systems to indicate that tampering has occurred. Tamper-evidence and accuracy of current OBM systems needed to be determined before any regulatory schemes were put in place for its use. Tests performed to determine potential for, and ease of, tampering. An algorithm was developed to detect tamper events. Its results are detailed.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The high morbidity and mortality associated with atherosclerotic coronary vascular disease (CVD) and its complications are being lessened by the increased knowledge of risk factors, effective preventative measures and proven therapeutic interventions. However, significant CVD morbidity remains and sudden cardiac death continues to be a presenting feature for some subsequently diagnosed with CVD. Coronary vascular disease is also the leading cause of anaesthesia related complications. Stress electrocardiography/exercise testing is predictive of 10 year risk of CVD events and the cardiovascular variables used to score this test are monitored peri-operatively. Similar physiological time-series datasets are being subjected to data mining methods for the prediction of medical diagnoses and outcomes. This study aims to find predictors of CVD using anaesthesia time-series data and patient risk factor data. Several pre-processing and predictive data mining methods are applied to this data. Physiological time-series data related to anaesthetic procedures are subjected to pre-processing methods for removal of outliers, calculation of moving averages as well as data summarisation and data abstraction methods. Feature selection methods of both wrapper and filter types are applied to derived physiological time-series variable sets alone and to the same variables combined with risk factor variables. The ability of these methods to identify subsets of highly correlated but non-redundant variables is assessed. The major dataset is derived from the entire anaesthesia population and subsets of this population are considered to be at increased anaesthesia risk based on their need for more intensive monitoring (invasive haemodynamic monitoring and additional ECG leads). Because of the unbalanced class distribution in the data, majority class under-sampling and Kappa statistic together with misclassification rate and area under the ROC curve (AUC) are used for evaluation of models generated using different prediction algorithms. The performance based on models derived from feature reduced datasets reveal the filter method, Cfs subset evaluation, to be most consistently effective although Consistency derived subsets tended to slightly increased accuracy but markedly increased complexity. The use of misclassification rate (MR) for model performance evaluation is influenced by class distribution. This could be eliminated by consideration of the AUC or Kappa statistic as well by evaluation of subsets with under-sampled majority class. The noise and outlier removal pre-processing methods produced models with MR ranging from 10.69 to 12.62 with the lowest value being for data from which both outliers and noise were removed (MR 10.69). For the raw time-series dataset, MR is 12.34. Feature selection results in reduction in MR to 9.8 to 10.16 with time segmented summary data (dataset F) MR being 9.8 and raw time-series summary data (dataset A) being 9.92. However, for all time-series only based datasets, the complexity is high. For most pre-processing methods, Cfs could identify a subset of correlated and non-redundant variables from the time-series alone datasets but models derived from these subsets are of one leaf only. MR values are consistent with class distribution in the subset folds evaluated in the n-cross validation method. For models based on Cfs selected time-series derived and risk factor (RF) variables, the MR ranges from 8.83 to 10.36 with dataset RF_A (raw time-series data and RF) being 8.85 and dataset RF_F (time segmented time-series variables and RF) being 9.09. The models based on counts of outliers and counts of data points outside normal range (Dataset RF_E) and derived variables based on time series transformed using Symbolic Aggregate Approximation (SAX) with associated time-series pattern cluster membership (Dataset RF_ G) perform the least well with MR of 10.25 and 10.36 respectively. For coronary vascular disease prediction, nearest neighbour (NNge) and the support vector machine based method, SMO, have the highest MR of 10.1 and 10.28 while logistic regression (LR) and the decision tree (DT) method, J48, have MR of 8.85 and 9.0 respectively. DT rules are most comprehensible and clinically relevant. The predictive accuracy increase achieved by addition of risk factor variables to time-series variable based models is significant. The addition of time-series derived variables to models based on risk factor variables alone is associated with a trend to improved performance. Data mining of feature reduced, anaesthesia time-series variables together with risk factor variables can produce compact and moderately accurate models able to predict coronary vascular disease. Decision tree analysis of time-series data combined with risk factor variables yields rules which are more accurate than models based on time-series data alone. The limited additional value provided by electrocardiographic variables when compared to use of risk factors alone is similar to recent suggestions that exercise electrocardiography (exECG) under standardised conditions has limited additional diagnostic value over risk factor analysis and symptom pattern. The effect of the pre-processing used in this study had limited effect when time-series variables and risk factor variables are used as model input. In the absence of risk factor input, the use of time-series variables after outlier removal and time series variables based on physiological variable values’ being outside the accepted normal range is associated with some improvement in model performance.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Type unions, pointer variables and function pointers are a long standing source of subtle security bugs in C program code. Their use can lead to hard-to-diagnose crashes or exploitable vulnerabilities that allow an attacker to attain privileged access over classified data. This paper describes an automatable framework for detecting such weaknesses in C programs statically, where possible, and for generating assertions that will detect them dynamically, in other cases. Exclusively based on analysis of the source code, it identifies required assertions using a type inference system supported by a custom made symbol table. In our preliminary findings, our type system was able to infer the correct type of unions in different scopes, without manual code annotations or rewriting. Whenever an evaluation is not possible or is difficult to resolve, appropriate runtime assertions are formed and inserted into the source code. The approach is demonstrated via a prototype C analysis tool.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Most information retrieval (IR) models treat the presence of a term within a document as an indication that the document is somehow "about" that term, they do not take into account when a term might be explicitly negated. Medical data, by its nature, contains a high frequency of negated terms - e.g. "review of systems showed no chest pain or shortness of breath". This papers presents a study of the effects of negation on information retrieval. We present a number of experiments to determine whether negation has a significant negative affect on IR performance and whether language models that take negation into account might improve performance. We use a collection of real medical records as our test corpus. Our findings are that negation has some affect on system performance, but this will likely be confined to domains such as medical data where negation is prevalent.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Queensland Department of Main Roads uses Weigh-in-Motion (WiM) devices to covertly monitor (at highway speed) axle mass, axle configurations and speed of heavy vehicles on the road network. Such data is critical for the planning and design of the road network. Some of the data appears excessively variable. The current work considers the nature, magnitude and possible causes of WiM data variability. Over fifty possible causes of variation in WiM data have been identified in the literature. Data exploration has highlighted five basic types of variability specifically: ----- • cycling, both diurnal and annual;----- • consistent but unreasonable data;----- • data jumps;----- • variations between data from opposite sides of the one road; and ----- • non-systematic variations.----- This work is part of wider research into procedures to eliminate or mitigate the influence of WiM data variability.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Many initiatives to improve Business processes are emerging. The essential roles and contributions of Business Analyst (BA) and Business Process Management (BPM) professionals to such initiatives have been recognized in literature and practice. The roles and responsibilities of a BA or BPM practitioner typically require different skill-sets; however these differences are often vague. This vagueness creates much confusion in practice and academia. While both the BA and BPM communities have made attempts to describe their domains through capability defining empirical research and developments of Bodies of knowledge, there has not yet been any attempt to identify the commonality of skills required and points of uniqueness between the two professions. This study aims to address this gap and presents the findings of a detailed content mapping exercise (using NVivo as a qualitative data analysis tool) of the International Institution of Business Analysis (IIBA®) Guide to the Business Analysis Body of Knowledge (BABOK® Guide) against core BPM competency and capability frameworks.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Data flow analysis techniques can be used to help assess threats to data confidentiality and integrity in security critical program code. However, a fundamental weakness of static analysis techniques is that they overestimate the ways in which data may propagate at run time. Discounting large numbers of these false-positive data flow paths wastes an information security evaluator's time and effort. Here we show how to automatically eliminate some false-positive data flow paths by precisely modelling how classified data is blocked by certain expressions in embedded C code. We present a library of detailed data flow models of individual expression elements and an algorithm for introducing these components into conventional data flow graphs. The resulting models can be used to accurately trace byte-level or even bit-level data flow through expressions that are normally treated as atomic. This allows us to identify expressions that safely downgrade their classified inputs and thereby eliminate false-positive data flow paths from the security evaluation process. To validate the approach we have implemented and tested it in an existing data flow analysis toolkit.