776 resultados para Machine learning methods


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The main subject of this master's thesis was predicting diffusion of innovations. The prediction was done in a special case: product has been available in some countries, and based on its diffusion in those countries the prediction is done for other countries. The prediction was based on finding similar countries with Self-Organizing Map~(SOM), using parameters of countries. Parameters included various economical and social key figures. SOM was optimised for different products using two different methods: (a) by adding diffusion information of products to the country parameters, and (b) by weighting the country parameters based on their importance for the diffusion of different products. A novel method using Differential Evolution (DE) was developed to solve the latter, highly non-linear optimisation problem. Results were fairly good. The prediction method seems to be on a solid theoretical foundation. The results based on country data were good. Instead, optimisation for different products did not generally offer clear benefit, but in some cases the improvement was clearly noticeable. The weights found for the parameters of the countries with the developed SOM optimisation method were interesting, and most of them could be explained by properties of the products.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Fluent health information flow is critical for clinical decision-making. However, a considerable part of this information is free-form text and inabilities to utilize it create risks to patient safety and cost-­effective hospital administration. Methods for automated processing of clinical text are emerging. The aim in this doctoral dissertation is to study machine learning and clinical text in order to support health information flow.First, by analyzing the content of authentic patient records, the aim is to specify clinical needs in order to guide the development of machine learning applications.The contributions are a model of the ideal information flow,a model of the problems and challenges in reality, and a road map for the technology development. Second, by developing applications for practical cases,the aim is to concretize ways to support health information flow. Altogether five machine learning applications for three practical cases are described: The first two applications are binary classification and regression related to the practical case of topic labeling and relevance ranking.The third and fourth application are supervised and unsupervised multi-class classification for the practical case of topic segmentation and labeling.These four applications are tested with Finnish intensive care patient records.The fifth application is multi-label classification for the practical task of diagnosis coding. It is tested with English radiology reports.The performance of all these applications is promising. Third, the aim is to study how the quality of machine learning applications can be reliably evaluated.The associations between performance evaluation measures and methods are addressed,and a new hold-out method is introduced.This method contributes not only to processing time but also to the evaluation diversity and quality. The main conclusion is that developing machine learning applications for text requires interdisciplinary, international collaboration. Practical cases are very different, and hence the development must begin from genuine user needs and domain expertise. The technological expertise must cover linguistics,machine learning, and information systems. Finally, the methods must be evaluated both statistically and through authentic user-feedback.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Learning Disability (LD) is a general term that describes specific kinds of learning problems. It is a neurological condition that affects a child's brain and impairs his ability to carry out one or many specific tasks. The learning disabled children are neither slow nor mentally retarded. This disorder can make it problematic for a child to learn as quickly or in the same way as some child who isn't affected by a learning disability. An affected child can have normal or above average intelligence. They may have difficulty paying attention, with reading or letter recognition, or with mathematics. It does not mean that children who have learning disabilities are less intelligent. In fact, many children who have learning disabilities are more intelligent than an average child. Learning disabilities vary from child to child. One child with LD may not have the same kind of learning problems as another child with LD. There is no cure for learning disabilities and they are life-long. However, children with LD can be high achievers and can be taught ways to get around the learning disability. In this research work, data mining using machine learning techniques are used to analyze the symptoms of LD, establish interrelationships between them and evaluate the relative importance of these symptoms. To increase the diagnostic accuracy of learning disability prediction, a knowledge based tool based on statistical machine learning or data mining techniques, with high accuracy,according to the knowledge obtained from the clinical information, is proposed. The basic idea of the developed knowledge based tool is to increase the accuracy of the learning disability assessment and reduce the time used for the same. Different statistical machine learning techniques in data mining are used in the study. Identifying the important parameters of LD prediction using the data mining techniques, identifying the hidden relationship between the symptoms of LD and estimating the relative significance of each symptoms of LD are also the parts of the objectives of this research work. The developed tool has many advantages compared to the traditional methods of using check lists in determination of learning disabilities. For improving the performance of various classifiers, we developed some preprocessing methods for the LD prediction system. A new system based on fuzzy and rough set models are also developed for LD prediction. Here also the importance of pre-processing is studied. A Graphical User Interface (GUI) is designed for developing an integrated knowledge based tool for prediction of LD as well as its degree. The designed tool stores the details of the children in the student database and retrieves their LD report as and when required. The present study undoubtedly proves the effectiveness of the tool developed based on various machine learning techniques. It also identifies the important parameters of LD and accurately predicts the learning disability in school age children. This thesis makes several major contributions in technical, general and social areas. The results are found very beneficial to the parents, teachers and the institutions. They are able to diagnose the child’s problem at an early stage and can go for the proper treatments/counseling at the correct time so as to avoid the academic and social losses.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Die zunehmende Vernetzung der Informations- und Kommunikationssysteme führt zu einer weiteren Erhöhung der Komplexität und damit auch zu einer weiteren Zunahme von Sicherheitslücken. Klassische Schutzmechanismen wie Firewall-Systeme und Anti-Malware-Lösungen bieten schon lange keinen Schutz mehr vor Eindringversuchen in IT-Infrastrukturen. Als ein sehr wirkungsvolles Instrument zum Schutz gegenüber Cyber-Attacken haben sich hierbei die Intrusion Detection Systeme (IDS) etabliert. Solche Systeme sammeln und analysieren Informationen von Netzwerkkomponenten und Rechnern, um ungewöhnliches Verhalten und Sicherheitsverletzungen automatisiert festzustellen. Während signatur-basierte Ansätze nur bereits bekannte Angriffsmuster detektieren können, sind anomalie-basierte IDS auch in der Lage, neue bisher unbekannte Angriffe (Zero-Day-Attacks) frühzeitig zu erkennen. Das Kernproblem von Intrusion Detection Systeme besteht jedoch in der optimalen Verarbeitung der gewaltigen Netzdaten und der Entwicklung eines in Echtzeit arbeitenden adaptiven Erkennungsmodells. Um diese Herausforderungen lösen zu können, stellt diese Dissertation ein Framework bereit, das aus zwei Hauptteilen besteht. Der erste Teil, OptiFilter genannt, verwendet ein dynamisches "Queuing Concept", um die zahlreich anfallenden Netzdaten weiter zu verarbeiten, baut fortlaufend Netzverbindungen auf, und exportiert strukturierte Input-Daten für das IDS. Den zweiten Teil stellt ein adaptiver Klassifikator dar, der ein Klassifikator-Modell basierend auf "Enhanced Growing Hierarchical Self Organizing Map" (EGHSOM), ein Modell für Netzwerk Normalzustand (NNB) und ein "Update Model" umfasst. In dem OptiFilter werden Tcpdump und SNMP traps benutzt, um die Netzwerkpakete und Hostereignisse fortlaufend zu aggregieren. Diese aggregierten Netzwerkpackete und Hostereignisse werden weiter analysiert und in Verbindungsvektoren umgewandelt. Zur Verbesserung der Erkennungsrate des adaptiven Klassifikators wird das künstliche neuronale Netz GHSOM intensiv untersucht und wesentlich weiterentwickelt. In dieser Dissertation werden unterschiedliche Ansätze vorgeschlagen und diskutiert. So wird eine classification-confidence margin threshold definiert, um die unbekannten bösartigen Verbindungen aufzudecken, die Stabilität der Wachstumstopologie durch neuartige Ansätze für die Initialisierung der Gewichtvektoren und durch die Stärkung der Winner Neuronen erhöht, und ein selbst-adaptives Verfahren eingeführt, um das Modell ständig aktualisieren zu können. Darüber hinaus besteht die Hauptaufgabe des NNB-Modells in der weiteren Untersuchung der erkannten unbekannten Verbindungen von der EGHSOM und der Überprüfung, ob sie normal sind. Jedoch, ändern sich die Netzverkehrsdaten wegen des Concept drif Phänomens ständig, was in Echtzeit zur Erzeugung nicht stationärer Netzdaten führt. Dieses Phänomen wird von dem Update-Modell besser kontrolliert. Das EGHSOM-Modell kann die neuen Anomalien effektiv erkennen und das NNB-Model passt die Änderungen in Netzdaten optimal an. Bei den experimentellen Untersuchungen hat das Framework erfolgversprechende Ergebnisse gezeigt. Im ersten Experiment wurde das Framework in Offline-Betriebsmodus evaluiert. Der OptiFilter wurde mit offline-, synthetischen- und realistischen Daten ausgewertet. Der adaptive Klassifikator wurde mit dem 10-Fold Cross Validation Verfahren evaluiert, um dessen Genauigkeit abzuschätzen. Im zweiten Experiment wurde das Framework auf einer 1 bis 10 GB Netzwerkstrecke installiert und im Online-Betriebsmodus in Echtzeit ausgewertet. Der OptiFilter hat erfolgreich die gewaltige Menge von Netzdaten in die strukturierten Verbindungsvektoren umgewandelt und der adaptive Klassifikator hat sie präzise klassifiziert. Die Vergleichsstudie zwischen dem entwickelten Framework und anderen bekannten IDS-Ansätzen zeigt, dass der vorgeschlagene IDSFramework alle anderen Ansätze übertrifft. Dies lässt sich auf folgende Kernpunkte zurückführen: Bearbeitung der gesammelten Netzdaten, Erreichung der besten Performanz (wie die Gesamtgenauigkeit), Detektieren unbekannter Verbindungen und Entwicklung des in Echtzeit arbeitenden Erkennungsmodells von Eindringversuchen.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In a global economy, manufacturers mainly compete with cost efficiency of production, as the price of raw materials are similar worldwide. Heavy industry has two big issues to deal with. On the one hand there is lots of data which needs to be analyzed in an effective manner, and on the other hand making big improvements via investments in cooperate structure or new machinery is neither economically nor physically viable. Machine learning offers a promising way for manufacturers to address both these problems as they are in an excellent position to employ learning techniques with their massive resource of historical production data. However, choosing modelling a strategy in this setting is far from trivial and this is the objective of this article. The article investigates characteristics of the most popular classifiers used in industry today. Support Vector Machines, Multilayer Perceptron, Decision Trees, Random Forests, and the meta-algorithms Bagging and Boosting are mainly investigated in this work. Lessons from real-world implementations of these learners are also provided together with future directions when different learners are expected to perform well. The importance of feature selection and relevant selection methods in an industrial setting are further investigated. Performance metrics have also been discussed for the sake of completion.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Surveillance Levels (SLs) are categories for medical patients (used in Brazil) that represent different types of medical recommendations. SLs are defined according to risk factors and the medical and developmental history of patients. Each SL is associated with specific educational and clinical measures. The objective of the present paper was to verify computer-aided, automatic assignment of SLs. The present paper proposes a computer-aided approach for automatic recommendation of SLs. The approach is based on the classification of information from patient electronic records. For this purpose, a software architecture composed of three layers was developed. The architecture is formed by a classification layer that includes a linguistic module and machine learning classification modules. The classification layer allows for the use of different classification methods, including the use of preprocessed, normalized language data drawn from the linguistic module. We report the verification and validation of the software architecture in a Brazilian pediatric healthcare institution. The results indicate that selection of attributes can have a great effect on the performance of the system. Nonetheless, our automatic recommendation of surveillance level can still benefit from improvements in processing procedures when the linguistic module is applied prior to classification. Results from our efforts can be applied to different types of medical systems. The results of systems supported by the framework presented in this paper may be used by healthcare and governmental institutions to improve healthcare services in terms of establishing preventive measures and alerting authorities about the possibility of an epidemic.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Information is nowadays a key resource: machine learning and data mining techniques have been developed to extract high-level information from great amounts of data. As most data comes in form of unstructured text in natural languages, research on text mining is currently very active and dealing with practical problems. Among these, text categorization deals with the automatic organization of large quantities of documents in priorly defined taxonomies of topic categories, possibly arranged in large hierarchies. In commonly proposed machine learning approaches, classifiers are automatically trained from pre-labeled documents: they can perform very accurate classification, but often require a consistent training set and notable computational effort. Methods for cross-domain text categorization have been proposed, allowing to leverage a set of labeled documents of one domain to classify those of another one. Most methods use advanced statistical techniques, usually involving tuning of parameters. A first contribution presented here is a method based on nearest centroid classification, where profiles of categories are generated from the known domain and then iteratively adapted to the unknown one. Despite being conceptually simple and having easily tuned parameters, this method achieves state-of-the-art accuracy in most benchmark datasets with fast running times. A second, deeper contribution involves the design of a domain-independent model to distinguish the degree and type of relatedness between arbitrary documents and topics, inferred from the different types of semantic relationships between respective representative words, identified by specific search algorithms. The application of this model is tested on both flat and hierarchical text categorization, where it potentially allows the efficient addition of new categories during classification. Results show that classification accuracy still requires improvements, but models generated from one domain are shown to be effectively able to be reused in a different one.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper addresses an investigation with machine learning (ML) classification techniques to assist in the problem of flash flood now casting. We have been attempting to build a Wireless Sensor Network (WSN) to collect measurements from a river located in an urban area. The machine learning classification methods were investigated with the aim of allowing flash flood now casting, which in turn allows the WSN to give alerts to the local population. We have evaluated several types of ML taking account of the different now casting stages (i.e. Number of future time steps to forecast). We have also evaluated different data representation to be used as input of the ML techniques. The results show that different data representation can lead to results significantly better for different stages of now casting.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a shallow dialogue analysis model, aimed at human-human dialogues in the context of staff or business meetings. Four components of the model are defined, and several machine learning techniques are used to extract features from dialogue transcripts: maximum entropy classifiers for dialogue acts, latent semantic analysis for topic segmentation, or decision tree classifiers for discourse markers. A rule-based approach is proposed for solving cross-modal references to meeting documents. The methods are trained and evaluated thanks to a common data set and annotation format. The integration of the components into an automated shallow dialogue parser opens the way to multimodal meeting processing and retrieval applications.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper discusses a novel hybrid approach for text categorization that combines a machine learning algorithm, which provides a base model trained with a labeled corpus, with a rule-based expert system, which is used to improve the results provided by the previous classifier, by filtering false positives and dealing with false negatives. The main advantage is that the system can be easily fine-tuned by adding specific rules for those noisy or conflicting categories that have not been successfully trained. We also describe an implementation based on k-Nearest Neighbor and a simple rule language to express lists of positive, negative and relevant (multiword) terms appearing in the input text. The system is evaluated in several scenarios, including the popular Reuters-21578 news corpus for comparison to other approaches, and categorization using IPTC metadata, EUROVOC thesaurus and others. Results show that this approach achieves a precision that is comparable to top ranked methods, with the added value that it does not require a demanding human expert workload to train

Relevância:

100.00% 100.00%

Publicador:

Resumo:

BACKGROUND: Clinical Trials (CTs) are essential for bridging the gap between experimental research on new drugs and their clinical application. Just like CTs for traditional drugs and biologics have helped accelerate the translation of biomedical findings into medical practice, CTs for nanodrugs and nanodevices could advance novel nanomaterials as agents for diagnosis and therapy. Although there is publicly available information about nanomedicine-related CTs, the online archiving of this information is carried out without adhering to criteria that discriminate between studies involving nanomaterials or nanotechnology-based processes (nano), and CTs that do not involve nanotechnology (non-nano). Finding out whether nanodrugs and nanodevices were involved in a study from CT summaries alone is a challenging task. At the time of writing, CTs archived in the well-known online registry ClinicalTrials.gov are not easily told apart as to whether they are nano or non-nano CTs-even when performed by domain experts, due to the lack of both a common definition for nanotechnology and of standards for reporting nanomedical experiments and results. METHODS: We propose a supervised learning approach for classifying CT summaries from ClinicalTrials.gov according to whether they fall into the nano or the non-nano categories. Our method involves several stages: i) extraction and manual annotation of CTs as nano vs. non-nano, ii) pre-processing and automatic classification, and iii) performance evaluation using several state-of-the-art classifiers under different transformations of the original dataset. RESULTS AND CONCLUSIONS: The performance of the best automated classifier closely matches that of experts (AUC over 0.95), suggesting that it is feasible to automatically detect the presence of nanotechnology products in CT summaries with a high degree of accuracy. This can significantly speed up the process of finding whether reports on ClinicalTrials.gov might be relevant to a particular nanoparticle or nanodevice, which is essential to discover any precedents for nanotoxicity events or advantages for targeted drug therapy.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We present a general approach to forming structure-activity relationships (SARs). This approach is based on representing chemical structure by atoms and their bond connectivities in combination with the inductive logic programming (ILP) algorithm PROGOL. Existing SAR methods describe chemical structure by using attributes which are general properties of an object. It is not possible to map chemical structure directly to attribute-based descriptions, as such descriptions have no internal organization. A more natural and general way to describe chemical structure is to use a relational description, where the internal construction of the description maps that of the object described. Our atom and bond connectivities representation is a relational description. ILP algorithms can form SARs with relational descriptions. We have tested the relational approach by investigating the SARs of 230 aromatic and heteroaromatic nitro compounds. These compounds had been split previously into two subsets, 188 compounds that were amenable to regression and 42 that were not. For the 188 compounds, a SAR was found that was as accurate as the best statistical or neural network-generated SARs. The PROGOL SAR has the advantages that it did not need the use of any indicator variables handcrafted by an expert, and the generated rules were easily comprehensible. For the 42 compounds, PROGOL formed a SAR that was significantly (P < 0.025) more accurate than linear regression, quadratic regression, and back-propagation. This SAR is based on an automatically generated structural alert for mutagenicity.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We present a machine learning-based system for automatically computing interpretable, quantitative measures of animal behavior. Through our interactive system, users encode their intuition about behavior by annotating a small set of video frames. These manual labels are converted into classifiers that can automatically annotate behaviors in screen-scale data sets. Our general-purpose system can create a variety of accurate individual and social behavior classifiers for different organisms, including mice and adult and larval Drosophila.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-06

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Today, the data available to tackle many scientific challenges is vast in quantity and diverse in nature. The exploration of heterogeneous information spaces requires suitable mining algorithms as well as effective visual interfaces. Most existing systems concentrate either on mining algorithms or on visualization techniques. Though visual methods developed in information visualization have been helpful, for improved understanding of a complex large high-dimensional dataset, there is a need for an effective projection of such a dataset onto a lower-dimension (2D or 3D) manifold. This paper introduces a flexible visual data mining framework which combines advanced projection algorithms developed in the machine learning domain and visual techniques developed in the information visualization domain. The framework follows Shneiderman’s mantra to provide an effective user interface. The advantage of such an interface is that the user is directly involved in the data mining process. We integrate principled projection methods, such as Generative Topographic Mapping (GTM) and Hierarchical GTM (HGTM), with powerful visual techniques, such as magnification factors, directional curvatures, parallel coordinates, billboarding, and user interaction facilities, to provide an integrated visual data mining framework. Results on a real life high-dimensional dataset from the chemoinformatics domain are also reported and discussed. Projection results of GTM are analytically compared with the projection results from other traditional projection methods, and it is also shown that the HGTM algorithm provides additional value for large datasets. The computational complexity of these algorithms is discussed to demonstrate their suitability for the visual data mining framework.