882 resultados para Machine Learning. Semissupervised learning. Multi-label classification. Reliability Parameter


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The real purpose of collecting big data is to identify causality in the hope that this will facilitate credible predictivity . But the search for causality can trap one into infinite regress, and thus one takes refuge in seeking associations between variables in data sets. Regrettably, the mere knowledge of associations does not enable predictivity. Associations need to be embedded within the framework of probability calculus to make coherent predictions. This is so because associations are a feature of probability models, and hence they do not exist outside the framework of a model. Measures of association, like correlation, regression, and mutual information merely refute a preconceived model. Estimated measures of associations do not lead to a probability model; a model is the product of pure thought. This paper discusses these and other fundamentals that are germane to seeking associations in particular, and machine learning in general. ACM Computing Classification System (1998): H.1.2, H.2.4., G.3.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The primary aim of this dissertation is to develop data mining tools for knowledge discovery in biomedical data when multiple (homogeneous or heterogeneous) sources of data are available. The central hypothesis is that, when information from multiple sources of data are used appropriately and effectively, knowledge discovery can be better achieved than what is possible from only a single source. ^ Recent advances in high-throughput technology have enabled biomedical researchers to generate large volumes of diverse types of data on a genome-wide scale. These data include DNA sequences, gene expression measurements, and much more; they provide the motivation for building analysis tools to elucidate the modular organization of the cell. The challenges include efficiently and accurately extracting information from the multiple data sources; representing the information effectively, developing analytical tools, and interpreting the results in the context of the domain. ^ The first part considers the application of feature-level integration to design classifiers that discriminate between soil types. The machine learning tools, SVM and KNN, were used to successfully distinguish between several soil samples. ^ The second part considers clustering using multiple heterogeneous data sources. The resulting Multi-Source Clustering (MSC) algorithm was shown to have a better performance than clustering methods that use only a single data source or a simple feature-level integration of heterogeneous data sources. ^ The third part proposes a new approach to effectively incorporate incomplete data into clustering analysis. Adapted from K-means algorithm, the Generalized Constrained Clustering (GCC) algorithm makes use of incomplete data in the form of constraints to perform exploratory analysis. Novel approaches for extracting constraints were proposed. For sufficiently large constraint sets, the GCC algorithm outperformed the MSC algorithm. ^ The last part considers the problem of providing a theme-specific environment for mining multi-source biomedical data. The database called PlasmoTFBM, focusing on gene regulation of Plasmodium falciparum, contains diverse information and has a simple interface to allow biologists to explore the data. It provided a framework for comparing different analytical tools for predicting regulatory elements and for designing useful data mining tools. ^ The conclusion is that the experiments reported in this dissertation strongly support the central hypothesis.^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The rapid growth of the Internet and the advancements of the Web technologies have made it possible for users to have access to large amounts of on-line music data, including music acoustic signals, lyrics, style/mood labels, and user-assigned tags. The progress has made music listening more fun, but has raised an issue of how to organize this data, and more generally, how computer programs can assist users in their music experience. An important subject in computer-aided music listening is music retrieval, i.e., the issue of efficiently helping users in locating the music they are looking for. Traditionally, songs were organized in a hierarchical structure such as genre->artist->album->track, to facilitate the users’ navigation. However, the intentions of the users are often hard to be captured in such a simply organized structure. The users may want to listen to music of a particular mood, style or topic; and/or any songs similar to some given music samples. This motivated us to work on user-centric music retrieval system to improve users’ satisfaction with the system. The traditional music information retrieval research was mainly concerned with classification, clustering, identification, and similarity search of acoustic data of music by way of feature extraction algorithms and machine learning techniques. More recently the music information retrieval research has focused on utilizing other types of data, such as lyrics, user-access patterns, and user-defined tags, and on targeting non-genre categories for classification, such as mood labels and styles. This dissertation focused on investigating and developing effective data mining techniques for (1) organizing and annotating music data with styles, moods and user-assigned tags; (2) performing effective analysis of music data with features from diverse information sources; and (3) recommending music songs to the users utilizing both content features and user access patterns.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Several are the areas in which digital images are used in solving day-to-day problems. In medicine the use of computer systems have improved the diagnosis and medical interpretations. In dentistry it’s not different, increasingly procedures assisted by computers have support dentists in their tasks. Set in this context, an area of dentistry known as public oral health is responsible for diagnosis and oral health treatment of a population. To this end, oral visual inspections are held in order to obtain oral health status information of a given population. From this collection of information, also known as epidemiological survey, the dentist can plan and evaluate taken actions for the different problems identified. This procedure has limiting factors, such as a limited number of qualified professionals to perform these tasks, different diagnoses interpretations among other factors. Given this context came the ideia of using intelligent systems techniques in supporting carrying out these tasks. Thus, it was proposed in this paper the development of an intelligent system able to segment, count and classify teeth from occlusal intraoral digital photographic images. The proposed system makes combined use of machine learning techniques and digital image processing. We first carried out a color-based segmentation on regions of interest, teeth and non teeth, in the images through the use of Support Vector Machine. After identifying these regions were used techniques based on morphological operators such as erosion and transformed watershed for counting and detecting the boundaries of the teeth, respectively. With the border detection of teeth was possible to calculate the Fourier descriptors for their shape and the position descriptors. Then the teeth were classified according to their types through the use of the SVM from the method one-against-all used in multiclass problem. The multiclass classification problem has been approached in two different ways. In the first approach we have considered three class types: molar, premolar and non teeth, while the second approach were considered five class types: molar, premolar, canine, incisor and non teeth. The system presented a satisfactory performance in the segmenting, counting and classification of teeth present in the images.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Several are the areas in which digital images are used in solving day-to-day problems. In medicine the use of computer systems have improved the diagnosis and medical interpretations. In dentistry it’s not different, increasingly procedures assisted by computers have support dentists in their tasks. Set in this context, an area of dentistry known as public oral health is responsible for diagnosis and oral health treatment of a population. To this end, oral visual inspections are held in order to obtain oral health status information of a given population. From this collection of information, also known as epidemiological survey, the dentist can plan and evaluate taken actions for the different problems identified. This procedure has limiting factors, such as a limited number of qualified professionals to perform these tasks, different diagnoses interpretations among other factors. Given this context came the ideia of using intelligent systems techniques in supporting carrying out these tasks. Thus, it was proposed in this paper the development of an intelligent system able to segment, count and classify teeth from occlusal intraoral digital photographic images. The proposed system makes combined use of machine learning techniques and digital image processing. We first carried out a color-based segmentation on regions of interest, teeth and non teeth, in the images through the use of Support Vector Machine. After identifying these regions were used techniques based on morphological operators such as erosion and transformed watershed for counting and detecting the boundaries of the teeth, respectively. With the border detection of teeth was possible to calculate the Fourier descriptors for their shape and the position descriptors. Then the teeth were classified according to their types through the use of the SVM from the method one-against-all used in multiclass problem. The multiclass classification problem has been approached in two different ways. In the first approach we have considered three class types: molar, premolar and non teeth, while the second approach were considered five class types: molar, premolar, canine, incisor and non teeth. The system presented a satisfactory performance in the segmenting, counting and classification of teeth present in the images.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A number of studies in the areas of Biomedical Engineering and Health Sciences have employed machine learning tools to develop methods capable of identifying patterns in different sets of data. Despite its extinction in many countries of the developed world, Hansen’s disease is still a disease that affects a huge part of the population in countries such as India and Brazil. In this context, this research proposes to develop a method that makes it possible to understand in the future how Hansen’s disease affects facial muscles. By using surface electromyography, a system was adapted so as to capture the signals from the largest possible number of facial muscles. We have first looked upon the literature to learn about the way researchers around the globe have been working with diseases that affect the peripheral neural system and how electromyography has acted to contribute to the understanding of these diseases. From these data, a protocol was proposed to collect facial surface electromyographic (sEMG) signals so that these signals presented a high signal to noise ratio. After collecting the signals, we looked for a method that would enable the visualization of this information in a way to make it possible to guarantee that the method used presented satisfactory results. After identifying the method's efficiency, we tried to understand which information could be extracted from the electromyographic signal representing the collected data. Once studies demonstrating which information could contribute to a better understanding of this pathology were not to be found in literature, parameters of amplitude, frequency and entropy were extracted from the signal and a feature selection was made in order to look for the features that better distinguish a healthy individual from a pathological one. After, we tried to identify the classifier that best discriminates distinct individuals from different groups, and also the set of parameters of this classifier that would bring the best outcome. It was identified that the protocol proposed in this study and the adaptation with disposable electrodes available in market proved their effectiveness and capability of being used in different studies whose intention is to collect data from facial electromyography. The feature selection algorithm also showed that not all of the features extracted from the signal are significant for data classification, with some more relevant than others. The classifier Support Vector Machine (SVM) proved itself efficient when the adequate Kernel function was used with the muscle from which information was to be extracted. Each investigated muscle presented different results when the classifier used linear, radial and polynomial kernel functions. Even though we have focused on Hansen’s disease, the method applied here can be used to study facial electromyography in other pathologies.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A number of studies in the areas of Biomedical Engineering and Health Sciences have employed machine learning tools to develop methods capable of identifying patterns in different sets of data. Despite its extinction in many countries of the developed world, Hansen’s disease is still a disease that affects a huge part of the population in countries such as India and Brazil. In this context, this research proposes to develop a method that makes it possible to understand in the future how Hansen’s disease affects facial muscles. By using surface electromyography, a system was adapted so as to capture the signals from the largest possible number of facial muscles. We have first looked upon the literature to learn about the way researchers around the globe have been working with diseases that affect the peripheral neural system and how electromyography has acted to contribute to the understanding of these diseases. From these data, a protocol was proposed to collect facial surface electromyographic (sEMG) signals so that these signals presented a high signal to noise ratio. After collecting the signals, we looked for a method that would enable the visualization of this information in a way to make it possible to guarantee that the method used presented satisfactory results. After identifying the method's efficiency, we tried to understand which information could be extracted from the electromyographic signal representing the collected data. Once studies demonstrating which information could contribute to a better understanding of this pathology were not to be found in literature, parameters of amplitude, frequency and entropy were extracted from the signal and a feature selection was made in order to look for the features that better distinguish a healthy individual from a pathological one. After, we tried to identify the classifier that best discriminates distinct individuals from different groups, and also the set of parameters of this classifier that would bring the best outcome. It was identified that the protocol proposed in this study and the adaptation with disposable electrodes available in market proved their effectiveness and capability of being used in different studies whose intention is to collect data from facial electromyography. The feature selection algorithm also showed that not all of the features extracted from the signal are significant for data classification, with some more relevant than others. The classifier Support Vector Machine (SVM) proved itself efficient when the adequate Kernel function was used with the muscle from which information was to be extracted. Each investigated muscle presented different results when the classifier used linear, radial and polynomial kernel functions. Even though we have focused on Hansen’s disease, the method applied here can be used to study facial electromyography in other pathologies.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Postprint

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Postprint

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Postprint

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Empirical studies of education programs and systems, by nature, rely upon use of student outcomes that are measurable. Often, these come in the form of test scores. However, in light of growing evidence about the long-run importance of other student skills and behaviors, the time has come for a broader approach to evaluating education. This dissertation undertakes experimental, quasi-experimental, and descriptive analyses to examine social, behavioral, and health-related mechanisms of the educational process. My overarching research question is simply, which inside- and outside-the-classroom features of schools and educational interventions are most beneficial to students in the long term? Furthermore, how can we apply this evidence toward informing policy that could effectively reduce stark social, educational, and economic inequalities?

The first study of three assesses mechanisms by which the Fast Track project, a randomized intervention in the early 1990s for high-risk children in four communities (Durham, NC; Nashville, TN; rural PA; and Seattle, WA), reduced delinquency, arrests, and health and mental health service utilization in adolescence through young adulthood (ages 12-20). A decomposition of treatment effects indicates that about a third of Fast Track’s impact on later crime outcomes can be accounted for by improvements in social and self-regulation skills during childhood (ages 6-11), such as prosocial behavior, emotion regulation and problem solving. These skills proved less valuable for the prevention of mental and physical health problems.

The second study contributes new evidence on how non-instructional investments – such as increased spending on school social workers, guidance counselors, and health services – affect multiple aspects of student performance and well-being. Merging several administrative data sources spanning the 1996-2013 school years in North Carolina, I use an instrumental variables approach to estimate the extent to which local expenditure shifts affect students’ academic and behavioral outcomes. My findings indicate that exogenous increases in spending on non-instructional services not only reduce student absenteeism and disciplinary problems (important predictors of long-term outcomes) but also significantly raise student achievement, in similar magnitude to corresponding increases in instructional spending. Furthermore, subgroup analyses suggest that investments in student support personnel such as social workers, health services, and guidance counselors, in schools with concentrated low-income student populations could go a long way toward closing socioeconomic achievement gaps.

The third study examines individual pathways that lead to high school graduation or dropout. It employs a variety of machine learning techniques, including decision trees, random forests with bagging and boosting, and support vector machines, to predict student dropout using longitudinal administrative data from North Carolina. I consider a large set of predictor measures from grades three through eight including academic achievement, behavioral indicators, and background characteristics. My findings indicate that the most important predictors include eighth grade absences, math scores, and age-for-grade as well as early reading scores. Support vector classification (with a high cost parameter and low gamma parameter) predicts high school dropout with the highest overall validity in the testing dataset at 90.1 percent followed by decision trees with boosting and interaction terms at 89.5 percent.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Hypertrophic cardiomyopathy (HCM) is a cardiovascular disease where the heart muscle is partially thickened and blood flow is - potentially fatally - obstructed. It is one of the leading causes of sudden cardiac death in young people. Electrocardiography (ECG) and Echocardiography (Echo) are the standard tests for identifying HCM and other cardiac abnormalities. The American Heart Association has recommended using a pre-participation questionnaire for young athletes instead of ECG or Echo tests due to considerations of cost and time involved in interpreting the results of these tests by an expert cardiologist. Initially we set out to develop a classifier for automated prediction of young athletes’ heart conditions based on the answers to the questionnaire. Classification results and further in-depth analysis using computational and statistical methods indicated significant shortcomings of the questionnaire in predicting cardiac abnormalities. Automated methods for analyzing ECG signals can help reduce cost and save time in the pre-participation screening process by detecting HCM and other cardiac abnormalities. Therefore, the main goal of this dissertation work is to identify HCM through computational analysis of 12-lead ECG. ECG signals recorded on one or two leads have been analyzed in the past for classifying individual heartbeats into different types of arrhythmia as annotated primarily in the MIT-BIH database. In contrast, we classify complete sequences of 12-lead ECGs to assign patients into two groups: HCM vs. non-HCM. The challenges and issues we address include missing ECG waves in one or more leads and the dimensionality of a large feature-set. We address these by proposing imputation and feature-selection methods. We develop heartbeat-classifiers by employing Random Forests and Support Vector Machines, and propose a method to classify full 12-lead ECGs based on the proportion of heartbeats classified as HCM. The results from our experiments show that the classifiers developed using our methods perform well in identifying HCM. Thus the two contributions of this thesis are the utilization of computational and statistical methods for discovering shortcomings in a current screening procedure and the development of methods to identify HCM through computational analysis of 12-lead ECG signals.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

There has been an increasing interest in the development of new methods using Pareto optimality to deal with multi-objective criteria (for example, accuracy and time complexity). Once one has developed an approach to a problem of interest, the problem is then how to compare it with the state of art. In machine learning, algorithms are typically evaluated by comparing their performance on different data sets by means of statistical tests. Standard tests used for this purpose are able to consider jointly neither performance measures nor multiple competitors at once. The aim of this paper is to resolve these issues by developing statistical procedures that are able to account for multiple competing measures at the same time and to compare multiple algorithms altogether. In particular, we develop two tests: a frequentist procedure based on the generalized likelihood-ratio test and a Bayesian procedure based on a multinomial-Dirichlet conjugate model. We further extend them by discovering conditional independences among measures to reduce the number of parameters of such models, as usually the number of studied cases is very reduced in such comparisons. Data from a comparison among general purpose classifiers is used to show a practical application of our tests.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Data mining can be defined as the extraction of implicit, previously un-known, and potentially useful information from data. Numerous re-searchers have been developing security technology and exploring new methods to detect cyber-attacks with the DARPA 1998 dataset for Intrusion Detection and the modified versions of this dataset KDDCup99 and NSL-KDD, but until now no one have examined the performance of the Top 10 data mining algorithms selected by experts in data mining. The compared classification learning algorithms in this thesis are: C4.5, CART, k-NN and Naïve Bayes. The performance of these algorithms are compared with accuracy, error rate and average cost on modified versions of NSL-KDD train and test dataset where the instances are classified into normal and four cyber-attack categories: DoS, Probing, R2L and U2R. Additionally the most important features to detect cyber-attacks in all categories and in each category are evaluated with Weka’s Attribute Evaluator and ranked according to Information Gain. The results show that the classification algorithm with best performance on the dataset is the k-NN algorithm. The most important features to detect cyber-attacks are basic features such as the number of seconds of a network connection, the protocol used for the connection, the network service used, normal or error status of the connection and the number of data bytes sent. The most important features to detect DoS, Probing and R2L attacks are basic features and the least important features are content features. Unlike U2R attacks, where the content features are the most important features to detect attacks.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In diesem Beitrag wird eine neue Methode zur Analyse des manuellen Kommissionierprozesses vorgestellt, mit der u. a. die Kommissionierzeitanteile automatisch erfasst werden können. Diese Methode basiert auf einer sensorgestützten Bewegungsklassifikation, wie sie bspw. im Sport oder in der Medizin Anwendung findet. Dabei werden mobile Sensoren genutzt, die fortlaufend Messwerte wie z. B. die Beschleunigung oder die Drehgeschwindigkeit des Kommissionierers aufzeichnen. Auf Basis dieser Daten können Informationen über die ausgeführten Bewegungen und insbesondere über die durchlaufenen Bewegungszustände gewonnen werden. Dieser Ansatz wird im vorliegenden Beitrag auf die Kommissionierung übertragen. Dazu werden zunächst Klassen relevanter Bewegungen identifiziert und anschließend mit Verfahren aus dem maschinellen Lernen verarbeitet. Die Klassifikation erfolgt nach dem Prinzip des überwachten Lernens. Dabei werden durchschnittliche Erkennungsraten von bis zu 78,94 Prozent erzielt.