900 resultados para Elaborazione d’immagini, Microscopia, Istopatologia, Classificazione, K-means
Resumo:
We propose a hybrid generative/discriminative framework for semantic parsing which combines the hidden vector state (HVS) model and the hidden Markov support vector machines (HM-SVMs). The HVS model is an extension of the basic discrete Markov model in which context is encoded as a stack-oriented state vector. The HM-SVMs combine the advantages of the hidden Markov models and the support vector machines. By employing a modified K-means clustering method, a small set of most representative sentences can be automatically selected from an un-annotated corpus. These sentences together with their abstract annotations are used to train an HVS model which could be subsequently applied on the whole corpus to generate semantic parsing results. The most confident semantic parsing results are selected to generate a fully-annotated corpus which is used to train the HM-SVMs. The proposed framework has been tested on the DARPA Communicator Data. Experimental results show that an improvement over the baseline HVS parser has been observed using the hybrid framework. When compared with the HM-SVMs trained from the fully-annotated corpus, the hybrid framework gave a comparable performance with only a small set of lightly annotated sentences. © 2008. Licensed under the Creative Commons.
Resumo:
Projection of a high-dimensional dataset onto a two-dimensional space is a useful tool to visualise structures and relationships in the dataset. However, a single two-dimensional visualisation may not display all the intrinsic structure. Therefore, hierarchical/multi-level visualisation methods have been used to extract more detailed understanding of the data. Here we propose a multi-level Gaussian process latent variable model (MLGPLVM). MLGPLVM works by segmenting data (with e.g. K-means, Gaussian mixture model or interactive clustering) in the visualisation space and then fitting a visualisation model to each subset. To measure the quality of multi-level visualisation (with respect to parent and child models), metrics such as trustworthiness, continuity, mean relative rank errors, visualisation distance distortion and the negative log-likelihood per point are used. We evaluate the MLGPLVM approach on the ‘Oil Flow’ dataset and a dataset of protein electrostatic potentials for the ‘Major Histocompatibility Complex (MHC) class I’ of humans. In both cases, visual observation and the quantitative quality measures have shown better visualisation at lower levels.
Resumo:
We present a test for identifying clusters in high dimensional data based on the k-means algorithm when the null hypothesis is spherical normal. We show that projection techniques used for evaluating validity of clusters may be misleading for such data. In particular, we demonstrate that increasingly well-separated clusters are identified as the dimensionality increases, when no such clusters exist. Furthermore, in a case of true bimodality, increasing the dimensionality makes identifying the correct clusters more difficult. In addition to the original conservative test, we propose a practical test with the same asymptotic behavior that performs well for a moderate number of points and moderate dimensionality. ACM Computing Classification System (1998): I.5.3.
Resumo:
A likviditás mérésére többféle mutató terjedt el, amelyek a likviditás jelenségét különböző szempontok alapján számszerűsítik. A cikk a szakirodalom által javasolt, különféle likviditási mutatókat elemzi sokdimenziós statisztikai módszerekkel: főkomponens-elemzés segítségével keresünk olyan faktorokat, amelyek legjobban tömörítik a likviditási jellemzőket, majd megnézzük, hogy az egyes mutatók milyen mértékben mozognak együtt a faktorokkal, illetve a korrelációk alapján klaszterezési eljárással keresünk hasonló tulajdonságokkal bíró csoportokat. Arra keressük a választ, hogy a rendelkezésünkre álló minta elemzésével kialakított változócsoportok egybeesnek-e a likviditás egyes aspektusaihoz kapcsolt mutatókkal, valamint meghatározhatók-e olyan összetett likviditási mérőszámok, amelyeknek a segítségével a likviditás jelensége több dimenzióban mérhető. / === / Liquidity is measured from different aspects (e.g. tightness, depth, and resiliency) by different ratios. We studied the co-movements and the clustering of different liquidity measures on a sample of the Swiss stock market. We performed a PCA to obtain the main factors that explain the cross-sectional variability of liquidity measures, and we used the k-means clustering methodology to defi ne groups of liquidity measures. Based on our explorative data analysis, we formed clusters of liquidity measures, and we compared the resulting groups with the expectations and intuition. Our modelling methodology provides a framework to analyze the correlation between the different aspects of liquidity as well as a means to defi ne complex liquidity measures.
Resumo:
This dissertation reports the results of a study that examined differences between genders in a sample of adolescents from a residential substance abuse treatment facility. The sample included 72 males and 65 females, ages 12 through 17. The data were archival, having been originally collected for a study of elopement from treatment. The current study included 23 variables. The variables were from multiple dimensions, including socioeconomic, legal, school, family, substance abuse, psychological, social support, and treatment histories. Collectively, they provided information about problem behaviors and psychosocial problems that are correlates of adolescent substance abuse. The study hypothesized that these problem behaviors and psychosocial problems exist in different patterns and combinations between genders.^ Further, it expected that these patterns and combinations would constitute profiles important for treatment. K-means cluster analysis identified differential profiles between genders in all three areas: problem behaviors, psychosocial problems, and treatment profiles. In the dimension of problem behaviors, the predominantly female group was characterized as suicidal and destructive, while the predominantly male group was identified as aggressive and low achieving. In the dimension of psychosocial problems, the predominantly female group was characterized as abused depressives, while the male group was identified as asocial, low problem severity. A third group, neither predominantly female or male, was characterized as social, high problem severity. When these dimensions were combined to form treatment profiles, the predominantly female group was characterized as abused, self-harmful, and social, and the male group was identified as aggressive, destructive, low achieving, and asocial. Finally, logistic regression and discriminant analysis were used to determine whether a history of sexual and physical abuse impacted problem behavior differentially between genders. Sexual abuse had a substantially greater influence in producing self-mutilating and suicidal behavior among females than among males. Additionally, a model including sexual abuse, physical abuse, low family support, and low support from friends showed a moderate capacity to predict unusual harmful behavior (fire-starting and cruelty to animals) among males. Implications for social work practice, social work research, and systems science are discussed. ^
Resumo:
The purpose of the study was to examine the relationship between teacher beliefs and actual classroom practice in early literacy instruction. Conjoint analysis was used to measure teachers' beliefs on four early literacy factors—phonological awareness, print awareness, graphophonic awareness, and structural awareness. A collective case study format was then used to measure the correspondence of teachers' beliefs with their actual classroom practice. ^ Ninety Project READS participants were given twelve cards in an orthogonal experimental design describing students that either met or did not meet criteria on the four early literacy factors. Conjoint measurements of whether the student is an efficient reader were taken. These measurements provided relative importance scores for each respondent. Based on the relative important scores, four teachers were chosen to participate in a collective case study. ^ The conjoint results enabled the clustering of teachers into four distinct groups, each aligned with one of the four early literacy factors. K-means cluster analysis of the relative importance measurements showed commonalities among the ninety respondents' beliefs. The collective case study results were mixed. Implications for researchers and practitioners include the use of conjoint analysis in measuring teacher beliefs on the four early literacy factors. Further, the understanding of teacher preferences on these beliefs may assist in the development of curriculum design and therefore increase educational effectiveness. Finally, comparisons between teachers' beliefs on the four early literacy factors and actual instructional practices may facilitate teacher self-reflection thus encouraging positive teacher change. ^
Resumo:
The primary aim of this dissertation is to develop data mining tools for knowledge discovery in biomedical data when multiple (homogeneous or heterogeneous) sources of data are available. The central hypothesis is that, when information from multiple sources of data are used appropriately and effectively, knowledge discovery can be better achieved than what is possible from only a single source. ^ Recent advances in high-throughput technology have enabled biomedical researchers to generate large volumes of diverse types of data on a genome-wide scale. These data include DNA sequences, gene expression measurements, and much more; they provide the motivation for building analysis tools to elucidate the modular organization of the cell. The challenges include efficiently and accurately extracting information from the multiple data sources; representing the information effectively, developing analytical tools, and interpreting the results in the context of the domain. ^ The first part considers the application of feature-level integration to design classifiers that discriminate between soil types. The machine learning tools, SVM and KNN, were used to successfully distinguish between several soil samples. ^ The second part considers clustering using multiple heterogeneous data sources. The resulting Multi-Source Clustering (MSC) algorithm was shown to have a better performance than clustering methods that use only a single data source or a simple feature-level integration of heterogeneous data sources. ^ The third part proposes a new approach to effectively incorporate incomplete data into clustering analysis. Adapted from K-means algorithm, the Generalized Constrained Clustering (GCC) algorithm makes use of incomplete data in the form of constraints to perform exploratory analysis. Novel approaches for extracting constraints were proposed. For sufficiently large constraint sets, the GCC algorithm outperformed the MSC algorithm. ^ The last part considers the problem of providing a theme-specific environment for mining multi-source biomedical data. The database called PlasmoTFBM, focusing on gene regulation of Plasmodium falciparum, contains diverse information and has a simple interface to allow biologists to explore the data. It provided a framework for comparing different analytical tools for predicting regulatory elements and for designing useful data mining tools. ^ The conclusion is that the experiments reported in this dissertation strongly support the central hypothesis.^
Resumo:
This dissertation introduces a new approach for assessing the effects of pediatric epilepsy on the language connectome. Two novel data-driven network construction approaches are presented. These methods rely on connecting different brain regions using either extent or intensity of language related activations as identified by independent component analysis of fMRI data. An auditory description decision task (ADDT) paradigm was used to activate the language network for 29 patients and 30 controls recruited from three major pediatric hospitals. Empirical evaluations illustrated that pediatric epilepsy can cause, or is associated with, a network efficiency reduction. Patients showed a propensity to inefficiently employ the whole brain network to perform the ADDT language task; on the contrary, controls seemed to efficiently use smaller segregated network components to achieve the same task. To explain the causes of the decreased efficiency, graph theoretical analysis was carried out. The analysis revealed no substantial global network feature differences between the patient and control groups. It also showed that for both subject groups the language network exhibited small-world characteristics; however, the patient's extent of activation network showed a tendency towards more random networks. It was also shown that the intensity of activation network displayed ipsilateral hub reorganization on the local level. The left hemispheric hubs displayed greater centrality values for patients, whereas the right hemispheric hubs displayed greater centrality values for controls. This hub hemispheric disparity was not correlated with a right atypical language laterality found in six patients. Finally it was shown that a multi-level unsupervised clustering scheme based on self-organizing maps, a type of artificial neural network, and k-means was able to fairly and blindly separate the subjects into their respective patient or control groups. The clustering was initiated using the local nodal centrality measurements only. Compared to the extent of activation network, the intensity of activation network clustering demonstrated better precision. This outcome supports the assertion that the local centrality differences presented by the intensity of activation network can be associated with focal epilepsy.^
Resumo:
The aim of the present study was to trace the mortality profile of the elderly in Brazil using two neighboring age groups: 60 to 69 years (young-old) and 80 years or more (oldest-old). To do this, we sought to characterize the trend and distinctions of different mortality profiles, as well as the quality of the data and associations with socioeconomic and sanitary conditions in the micro-regions of Brazil. Data was collected from the Mortality Information System (SIM) and the Brazilian Institute of Geography and Statistics (IBGE). Based on these data, the coefficients of mortality were calculated for the chapters of the International Disease Classification (ICD-10). A polynomial regression model was used to ascertain the trend of the main chapters. Non-hierarchical cluster analysis (K-Means) was used to obtain the profiles for different Brazilian micro-regions. Factorial analysis of the contextual variables was used to obtain the socio-economic and sanitary deprivation indices (IPSS). The trend of the CMId and of the ratio of its values in the two age groups confirmed a decrease in most of the indicators, particularly for badly-defined causes among the oldest-old. Among the young-old, the following profiles emerged: the Development Profile; the Modernity Profile; the Epidemiological Paradox Profile and the Ignorance Profile. Among the oldest-old, the latter three profiles were confirmed, in addition to the Low Mortality Rates Profile. When comparing the mean IPSS values in global terms, all of the groups were different in both of the age groups. The Ignorance Profile was compared with the other profiles using orthogonal contrasts. This profile differed from all of the others in isolation and in clusters. However, the mean IPSS was similar for the Low Mortality Rates Profile among the oldest-old. Furthermore, associations were found between the data quality indicators, the CMId for badly-defined causes, the general coefficient of mortality for each age group (CGMId) and the IPSS of the micro-regions. The worst rates were recorded in areas with the greatest socioeconomic and sanitary deprivation. The findings of the present study show that, despite the decrease in the mortality coefficients, there are notable differences in the profiles related to contextual conditions, including regional differences in data quality. These differences increase the vulnerability of the age groups studied and the health iniquities that are already present.
Resumo:
This study subdivides the Weddell Sea, Antarctica, into seafloor regions using multivariate statistical methods. These regions are categories used for comparing, contrasting and quantifying biogeochemical processes and biodiversity between ocean regions geographically but also regions under development within the scope of global change. The division obtained is characterized by the dominating components and interpreted in terms of ruling environmental conditions. The analysis uses 28 environmental variables for the sea surface, 25 variables for the seabed and 9 variables for the analysis between surface and bottom variables. The data were taken during the years 1983-2013. Some data were interpolated. The statistical errors of several interpolation methods (e.g. IDW, Indicator, Ordinary and Co-Kriging) with changing settings have been compared for the identification of the most reasonable method. The multivariate mathematical procedures used are regionalized classification via k means cluster analysis, canonical-correlation analysis and multidimensional scaling. Canonical-correlation analysis identifies the influencing factors in the different parts of the cove. Several methods for the identification of the optimum number of clusters have been tested. For the seabed 8 and 12 clusters were identified as reasonable numbers for clustering the Weddell Sea. For the sea surface the numbers 8 and 13 and for the top/bottom analysis 8 and 3 were identified, respectively. Additionally, the results of 20 clusters are presented for the three alternatives offering the first small scale environmental regionalization of the Weddell Sea. Especially the results of 12 clusters identify marine-influenced regions which can be clearly separated from those determined by the geological catchment area and the ones dominated by river discharge.
Resumo:
Skeletal muscle consists of muscle fiber types that have different physiological and biochemical characteristics. Basically, the muscle fiber can be classified into type I and type II, presenting, among other features, contraction speed and sensitivity to fatigue different for each type of muscle fiber. These fibers coexist in the skeletal muscles and their relative proportions are modulated according to the muscle functionality and the stimulus that is submitted. To identify the different proportions of fiber types in the muscle composition, many studies use biopsy as standard procedure. As the surface electromyography (EMGs) allows to extract information about the recruitment of different motor units, this study is based on the assumption that it is possible to use the EMG to identify different proportions of fiber types in a muscle. The goal of this study was to identify the characteristics of the EMG signals which are able to distinguish, more precisely, different proportions of fiber types. Also was investigated the combination of characteristics using appropriate mathematical models. To achieve the proposed objective, simulated signals were developed with different proportions of motor units recruited and with different signal-to-noise ratios. Thirteen characteristics in function of time and the frequency were extracted from emulated signals. The results for each extracted feature of the signals were submitted to the clustering algorithm k-means to separate the different proportions of motor units recruited on the emulated signals. Mathematical techniques (confusion matrix and analysis of capability) were implemented to select the characteristics able to identify different proportions of muscle fiber types. As a result, the average frequency and median frequency were selected as able to distinguish, with more precision, the proportions of different muscle fiber types. Posteriorly, the features considered most able were analyzed in an associated way through principal component analysis. Were found two principal components of the signals emulated without noise (CP1 and CP2) and two principal components of the noisy signals (CP1 and CP2 ). The first principal components (CP1 and CP1 ) were identified as being able to distinguish different proportions of muscle fiber types. The selected characteristics (median frequency, mean frequency, CP1 and CP1 ) were used to analyze real EMGs signals, comparing sedentary people with physically active people who practice strength training (weight training). The results obtained with the different groups of volunteers show that the physically active people obtained higher values of mean frequency, median frequency and principal components compared with the sedentary people. Moreover, these values decreased with increasing power level for both groups, however, the decline was more accented for the group of physically active people. Based on these results, it is assumed that the volunteers of the physically active group have higher proportions of type II fibers than sedentary people. Finally, based on these results, we can conclude that the selected characteristics were able to distinguish different proportions of muscle fiber types, both for the emulated signals as to the real signals. These characteristics can be used in several studies, for example, to evaluate the progress of people with myopathy and neuromyopathy due to the physiotherapy, and also to analyze the development of athletes to improve their muscle capacity according to their sport. In both cases, the extraction of these characteristics from the surface electromyography signals provides a feedback to the physiotherapist and the coach physical, who can analyze the increase in the proportion of a given type of fiber, as desired in each case.
Resumo:
A modelação dos sistemas industriais apresenta para as organizações uma vantagem estratégica no domínio do estudo dos seus processos produtivos. Através da modelação será possível aumentar o conhecimento sobre os sistemas podendo permitir, quando possível, melhorias na gestão e planeamento da produção. Este conhecimento poderá permitir também um aumento da eficiência dos processos produtivos, através da melhoria ou eliminação das principais perdas detetadas no processo. Este trabalho tem como principal objetivo o desenvolvimento e validação de uma ferramenta de modelação, previsão e análise para sistemas produtivos industriais, tendo em vista o aumento do conhecimento sobre estes. Para a execução e desenvolvimento deste trabalho, foram utilizadas e desenvolvidas várias ferramentas, conceitos, metodologias e fundamentos teóricos conhecidos da bibliografia, como OEE (Overall Equipment Effectiveness), RdP (Redes de Petri), Séries Temporais, Kmeans, ou SPC (Statistical Process Control). A ferramenta de modelação, previsão e análise desenvolvida e descrita neste trabalho, mostrou-se capaz de auxiliar na deteção e interpretação das causas que influenciam os resultados do sistema produtivo e originam perdas, demonstrando as vantagens esperadas. Estes resultados foram baseados em dados reais de um sistema produtivo.
Resumo:
Understanding the impact of atmospheric black carbon (BC) containing particles on human health and radiative forcing requires knowledge of the mixing state of BC, including the characteristics of the materials with which it is internally mixed. In this study, we demonstrate for the first time the capabilities of the Aerodyne Soot-Particle Aerosol Mass Spectrometer equipped with a light scattering module (LS-SP-AMS) to examine the mixing state of refractory BC (rBC) and other aerosol components in an urban environment (downtown Toronto). K-means clustering analysis was used to classify single particle mass spectra into chemically distinct groups. One resultant cluster is dominated by rBC mass spectral signals (C+1 to C+5) while the organic signals fall into a few major clusters, identified as hydrocarbon-like organic aerosol (HOA), oxygenated organic aerosol (OOA), and cooking emission organic aerosol (COA). A nearly external mixing is observed with small BC particles only thinly coated by HOA ( 28% by mass on average), while over 90% of the HOA-rich particles did not contain detectable amounts of rBC. Most of the particles classified into other inorganic and organic clusters were not significantly associated with BC. The single particle results also suggest that HOA and COA emitted from anthropogenic sources were likely major contributors to organic-rich particles with low to mid-range aerodynamic diameter (dva). The similar temporal profiles and mass spectral features of the organic clusters and the factors from a positive matrix factorization (PMF) analysis of the ensemble aerosol dataset validate the conventional interpretation of the PMF results.
Resumo:
Ground Delay Programs (GDP) are sometimes cancelled before their initial planned duration and for this reason aircraft are delayed when it is no longer needed. Recovering this delay usually leads to extra fuel consumption, since the aircraft will typically depart after having absorbed on ground their assigned delay and, therefore, they will need to cruise at more fuel consuming speeds. Past research has proposed speed reduction strategy aiming at splitting the GDP-assigned delay between ground and airborne delay, while using the same fuel as in nominal conditions. Being airborne earlier, an aircraft can speed up to nominal cruise speed and recover part of the GDP delay without incurring extra fuel consumption if the GDP is cancelled earlier than planned. In this paper, all GDP initiatives that occurred in San Francisco International Airport during 2006 are studied and characterised by a K-means algorithm into three different clusters. The centroids for these three clusters have been used to simulate three different GDPs at the airport by using a realistic set of inbound traffic and the Future Air Traffic Management Concepts Evaluation Tool (FACET). The amount of delay that can be recovered using this cruise speed reduction technique, as a function of the GDP cancellation time, has been computed and compared with the delay recovered with the current concept of operations. Simulations have been conducted in calm wind situation and without considering a radius of exemption. Results indicate that when aircraft depart early and fly at the slower speed they can recover additional delays, compared to current operations where all delays are absorbed prior to take-off, in the event the GDP cancels early. There is a variability of extra delay recovered, being more significant, in relative terms, for those GDPs with a relatively low amount of demand exceeding the airport capacity.
Resumo:
Clustering algorithms, pattern mining techniques and associated quality metrics emerged as reliable methods for modeling learners’ performance, comprehension and interaction in given educational scenarios. The specificity of available data such as missing values, extreme values or outliers, creates a challenge to extract significant user models from an educational perspective. In this paper we introduce a pattern detection mechanism with-in our data analytics tool based on k-means clustering and on SSE, silhouette, Dunn index and Xi-Beni index quality metrics. Experiments performed on a dataset obtained from our online e-learning platform show that the extracted interaction patterns were representative in classifying learners. Furthermore, the performed monitoring activities created a strong basis for generating automatic feedback to learners in terms of their course participation, while relying on their previous performance. In addition, our analysis introduces automatic triggers that highlight learners who will potentially fail the course, enabling tutors to take timely actions.