35 resultados para Classification algorithm
Resumo:
Recent advances in machine learning methods enable increasingly the automatic construction of various types of computer assisted methods that have been difficult or laborious to program by human experts. The tasks for which this kind of tools are needed arise in many areas, here especially in the fields of bioinformatics and natural language processing. The machine learning methods may not work satisfactorily if they are not appropriately tailored to the task in question. However, their learning performance can often be improved by taking advantage of deeper insight of the application domain or the learning problem at hand. This thesis considers developing kernel-based learning algorithms incorporating this kind of prior knowledge of the task in question in an advantageous way. Moreover, computationally efficient algorithms for training the learning machines for specific tasks are presented. In the context of kernel-based learning methods, the incorporation of prior knowledge is often done by designing appropriate kernel functions. Another well-known way is to develop cost functions that fit to the task under consideration. For disambiguation tasks in natural language, we develop kernel functions that take account of the positional information and the mutual similarities of words. It is shown that the use of this information significantly improves the disambiguation performance of the learning machine. Further, we design a new cost function that is better suitable for the task of information retrieval and for more general ranking problems than the cost functions designed for regression and classification. We also consider other applications of the kernel-based learning algorithms such as text categorization, and pattern recognition in differential display. We develop computationally efficient algorithms for training the considered learning machines with the proposed kernel functions. We also design a fast cross-validation algorithm for regularized least-squares type of learning algorithm. Further, an efficient version of the regularized least-squares algorithm that can be used together with the new cost function for preference learning and ranking tasks is proposed. In summary, we demonstrate that the incorporation of prior knowledge is possible and beneficial, and novel advanced kernels and cost functions can be used in algorithms efficiently.
Resumo:
Coherent anti-Stokes Raman scattering is the powerful method of laser spectroscopy in which significant successes are achieved. However, the non-linear nature of CARS complicates the analysis of the received spectra. The objective of this Thesis is to develop a new phase retrieval algorithm for CARS. It utilizes the maximum entropy method and the new wavelet approach for spectroscopic background correction of a phase function. The method was developed to be easily automated and used on a large number of spectra of different substances.. The algorithm was successfully tested on experimental data.
Resumo:
Learning of preference relations has recently received significant attention in machine learning community. It is closely related to the classification and regression analysis and can be reduced to these tasks. However, preference learning involves prediction of ordering of the data points rather than prediction of a single numerical value as in case of regression or a class label as in case of classification. Therefore, studying preference relations within a separate framework facilitates not only better theoretical understanding of the problem, but also motivates development of the efficient algorithms for the task. Preference learning has many applications in domains such as information retrieval, bioinformatics, natural language processing, etc. For example, algorithms that learn to rank are frequently used in search engines for ordering documents retrieved by the query. Preference learning methods have been also applied to collaborative filtering problems for predicting individual customer choices from the vast amount of user generated feedback. In this thesis we propose several algorithms for learning preference relations. These algorithms stem from well founded and robust class of regularized least-squares methods and have many attractive computational properties. In order to improve the performance of our methods, we introduce several non-linear kernel functions. Thus, contribution of this thesis is twofold: kernel functions for structured data that are used to take advantage of various non-vectorial data representations and the preference learning algorithms that are suitable for different tasks, namely efficient learning of preference relations, learning with large amount of training data, and semi-supervised preference learning. Proposed kernel-based algorithms and kernels are applied to the parse ranking task in natural language processing, document ranking in information retrieval, and remote homology detection in bioinformatics domain. Training of kernel-based ranking algorithms can be infeasible when the size of the training set is large. This problem is addressed by proposing a preference learning algorithm whose computation complexity scales linearly with the number of training data points. We also introduce sparse approximation of the algorithm that can be efficiently trained with large amount of data. For situations when small amount of labeled data but a large amount of unlabeled data is available, we propose a co-regularized preference learning algorithm. To conclude, the methods presented in this thesis address not only the problem of the efficient training of the algorithms but also fast regularization parameter selection, multiple output prediction, and cross-validation. Furthermore, proposed algorithms lead to notably better performance in many preference learning tasks considered.
Resumo:
Software testing is one of the essential parts in software engineering process. The objective of the study was to describe software testing tools and the corresponding use. The thesis contains examples of software testing tools usage. The study was conducted as a literature study, with focus on current software testing practices and quality assurance standards. In the paper a tool classifier was employed, and testing tools presented in study were classified according to it. We found that it is difficult to distinguish current available tools by certain testing activities as many of them contain functionality that exceeds scopes of a single testing type.
Resumo:
The purpose of this study is to view credit risk from the financier’s point of view in a theoretical framework. Results and aspects of the previous studies regarding measuring credit risk with accounting based scoring models are also examined. The theoretical framework and previous studies are then used to support the empirical analysis which aims to develop a credit risk measure for a bank’s internal use or a risk management tool for a company to indicate its credit risk to the financier. The study covers a sample of Finnish companies from 12 different industries and four different company categories and employs their accounting information from 2004 to 2008. The empirical analysis consists of six stage methodology process which uses measures of profitability, liquidity, capital structure and cash flow to determine financier’s credit risk, define five significant risk classes and produce risk classification model. The study is confidential until 15.10.2012.
Resumo:
In the Russian Wholesale Market, electricity and capacity are traded separately. Capacity is a special good, the sale of which obliges suppliers to keep their generating equipment ready to produce the quantity of electricity indicated by the System Operator. The purpose of the formation of capacity trading was the maintenance of reliable and uninterrupted delivery of electricity in the wholesale market. The price of capacity reflects constant investments in construction, modernization and maintenance of power plants. So, the capacity sale creates favorable conditions to attract investments in the energy sector because it guarantees the investor that his investments will be returned.
Resumo:
Diabetes is a rapidly increasing worldwide problem which is characterised by defective metabolism of glucose that causes long-term dysfunction and failure of various organs. The most common complication of diabetes is diabetic retinopathy (DR), which is one of the primary causes of blindness and visual impairment in adults. The rapid increase of diabetes pushes the limits of the current DR screening capabilities for which the digital imaging of the eye fundus (retinal imaging), and automatic or semi-automatic image analysis algorithms provide a potential solution. In this work, the use of colour in the detection of diabetic retinopathy is statistically studied using a supervised algorithm based on one-class classification and Gaussian mixture model estimation. The presented algorithm distinguishes a certain diabetic lesion type from all other possible objects in eye fundus images by only estimating the probability density function of that certain lesion type. For the training and ground truth estimation, the algorithm combines manual annotations of several experts for which the best practices were experimentally selected. By assessing the algorithm’s performance while conducting experiments with the colour space selection, both illuminance and colour correction, and background class information, the use of colour in the detection of diabetic retinopathy was quantitatively evaluated. Another contribution of this work is the benchmarking framework for eye fundus image analysis algorithms needed for the development of the automatic DR detection algorithms. The benchmarking framework provides guidelines on how to construct a benchmarking database that comprises true patient images, ground truth, and an evaluation protocol. The evaluation is based on the standard receiver operating characteristics analysis and it follows the medical practice in the decision making providing protocols for image- and pixel-based evaluations. During the work, two public medical image databases with ground truth were published: DIARETDB0 and DIARETDB1. The framework, DR databases and the final algorithm, are made public in the web to set the baseline results for automatic detection of diabetic retinopathy. Although deviating from the general context of the thesis, a simple and effective optic disc localisation method is presented. The optic disc localisation is discussed, since normal eye fundus structures are fundamental in the characterisation of DR.
Resumo:
Female sexual dysfunctions, including desire, arousal, orgasm and pain problems, have been shown to be highly prevalent among women around the world. The etiology of these dysfunctions is unclear but associations with health, age, psychological problems, and relationship factors have been identified. Genetic effects explain individual variation in orgasm function to some extent but until now quantitative behavior genetic analyses have not been applied to other sexual functions. In addition, behavior genetics can be applied to exploring the cause of any observed comorbidity between the dysfunctions. Discovering more about the etiology of the dysfunctions may further improve the classification systems which are currently under intense debate. The aims of the present thesis were to evaluate the psychometric properties of a Finnish-language version of a commonly used questionnaire for measuring female sexual function, the Female Sexual Function Index (FSFI), in order to investigate prevalence, comorbidity, and classification, and to explore the balance of genetic and environmental factors in the etiology as well as the associations of a number of biopsychosocial factors with female sexual functions. Female sexual functions were studied through survey methods in a population based sample of Finnish twins and their female siblings. There were two waves of data collection. The first data collection targeted 5,000 female twins aged 33–43 years and the second 7,680 female twins aged 18–33 and their over 18–year-old female siblings (n = 3,983). There was no overlap between the data collections. The combined overall response rate for both data collections was 53% (n = 8,868), with a better response rate in the second (57%) compared to the first (45%). In order to measure female sexual function, the FSFI was used. It includes 19 items which measure female sexual function during the previous four weeks in six subdomains; desire, subjective arousal, lubrication, orgasm, sexual satisfaction, and pain. In line with earlier research in clinical populations, a six factor solution of the Finnish-language version of the FSFI received supported. The internal consistencies of the scales were good to excellent. Some questions about how to avoid overestimating the prevalence of extreme dysfunctions due to women being allocated the score of zero if they had had no sexual activity during the preceding four weeks were raised. The prevalence of female sexual dysfunctions per se ranged from 11% for lubrication dysfunction to 55% for desire dysfunction. The prevalence rates for sexual dysfunction with concomitant sexual distress, in other words, sexual disorders were notably lower ranging from 7% for lubrication disorder to 23% for desire disorder. The comorbidity between the dysfunctions was substantial most notably between arousal and lubrication dysfunction even if these two dysfunctions showed distinct patterns of associations with the other dysfunctions. Genetic influences on individual variation in the six subdomains of FSFI were modest but significant ranging from 3–11% for additive genetic effects and 5–18% for nonadditive genetic effects. The rest of the variation in sexual functions was explained by nonshared environmental influences. A correlated factor model, including additive and nonadditive genetic effects and nonshared environmental effects had the best fit. All in all, every correlation between the genetic factors was significant except between lubrication and pain. All correlations between the nonshared environment factors were significant showing that there is a substantial overlap in genetic and nonshared environmental influences between the dysfunctions. In general, psychological problems, poor satisfaction with the relationship, sexual distress, and poor partner compatibility were associated with more sexual dysfunctions. Age was confounded with relationship length but had over and above relationship length a negative effect on desire and sexual satisfaction and a positive effect on orgasm and pain functions. Alcohol consumption in general was associated with better desire, arousal, lubrication, and orgasm function. Women pregnant with their first child had fewer pain problems than nulliparous nonpregnant women. Multiparous pregnant women had more orgasm problems compared to multiparous nonpregnant women. Having children was associated with less orgasm and pain problems. The conclusions were that desire, subjective arousal, lubrication, orgasm, sexual satisfaction, and pain are separate entities that have distinct associations with a number of different biopsychosocial factors. However, there is also considerable comorbidity between the dysfunctions which are explained by overlap in additive genetic, nonadditive genetic and nonshared environmental influences. Sexual dysfunctions are highly prevalent and are not always associated with sexual distress and this relationship might be moderated by a good relationship and compatibility with partner. Regarding classification, the results supports separate diagnoses for subjective arousal and genital arousal as well as the inclusion of pain under sexual dysfunctions.
Resumo:
In this work a fuzzy linear system is used to solve Leontief input-output model with fuzzy entries. For solving this model, we assume that the consumption matrix from di erent sectors of the economy and demand are known. These assumptions heavily depend on the information obtained from the industries. Hence uncertainties are involved in this information. The aim of this work is to model these uncertainties and to address them by fuzzy entries such as fuzzy numbers and LR-type fuzzy numbers (triangular and trapezoidal). Fuzzy linear system has been developed using fuzzy data and it is solved using Gauss-Seidel algorithm. Numerical examples show the e ciency of this algorithm. The famous example from Prof. Leontief, where he solved the production levels for U.S. economy in 1958, is also further analyzed.
Resumo:
I doktorsavhandlingen undersöks förmågan att lösa hos ett antal lösare för optimeringsproblem och ett antal svårigheter med att göra en rättvis lösarjämförelse avslöjas. Dessutom framläggs några förbättringar som utförts på en av lösarna som heter GAMS/AlphaECP. Optimering innebär, i det här sammanhanget, att finna den bästa möjliga lösningen på ett problem. Den undersökta klassen av problem kan karaktäriseras som svårlöst och förekommer inom ett flertal industriområden. Målet har varit att undersöka om det finns en lösare som är universellt snabbare och hittar lösningar med högre kvalitet än någon av de andra lösarna. Det kommersiella optimeringssystemet GAMS (General Algebraic Modeling System) och omfattande problembibliotek har använts för att jämföra lösare. Förbättringarna som presenterats har utförts på GAMS/AlphaECP lösaren som baserar sig på skärplansmetoden Extended Cutting Plane (ECP). ECP-metoden har utvecklats främst av professor Tapio Westerlund på Anläggnings- och systemteknik vid Åbo Akademi.
Resumo:
Machine learning provides tools for automated construction of predictive models in data intensive areas of engineering and science. The family of regularized kernel methods have in the recent years become one of the mainstream approaches to machine learning, due to a number of advantages the methods share. The approach provides theoretically well-founded solutions to the problems of under- and overfitting, allows learning from structured data, and has been empirically demonstrated to yield high predictive performance on a wide range of application domains. Historically, the problems of classification and regression have gained the majority of attention in the field. In this thesis we focus on another type of learning problem, that of learning to rank. In learning to rank, the aim is from a set of past observations to learn a ranking function that can order new objects according to how well they match some underlying criterion of goodness. As an important special case of the setting, we can recover the bipartite ranking problem, corresponding to maximizing the area under the ROC curve (AUC) in binary classification. Ranking applications appear in a large variety of settings, examples encountered in this thesis include document retrieval in web search, recommender systems, information extraction and automated parsing of natural language. We consider the pairwise approach to learning to rank, where ranking models are learned by minimizing the expected probability of ranking any two randomly drawn test examples incorrectly. The development of computationally efficient kernel methods, based on this approach, has in the past proven to be challenging. Moreover, it is not clear what techniques for estimating the predictive performance of learned models are the most reliable in the ranking setting, and how the techniques can be implemented efficiently. The contributions of this thesis are as follows. First, we develop RankRLS, a computationally efficient kernel method for learning to rank, that is based on minimizing a regularized pairwise least-squares loss. In addition to training methods, we introduce a variety of algorithms for tasks such as model selection, multi-output learning, and cross-validation, based on computational shortcuts from matrix algebra. Second, we improve the fastest known training method for the linear version of the RankSVM algorithm, which is one of the most well established methods for learning to rank. Third, we study the combination of the empirical kernel map and reduced set approximation, which allows the large-scale training of kernel machines using linear solvers, and propose computationally efficient solutions to cross-validation when using the approach. Next, we explore the problem of reliable cross-validation when using AUC as a performance criterion, through an extensive simulation study. We demonstrate that the proposed leave-pair-out cross-validation approach leads to more reliable performance estimation than commonly used alternative approaches. Finally, we present a case study on applying machine learning to information extraction from biomedical literature, which combines several of the approaches considered in the thesis. The thesis is divided into two parts. Part I provides the background for the research work and summarizes the most central results, Part II consists of the five original research articles that are the main contribution of this thesis.
Resumo:
In this study, feature selection in classification based problems is highlighted. The role of feature selection methods is to select important features by discarding redundant and irrelevant features in the data set, we investigated this case by using fuzzy entropy measures. We developed fuzzy entropy based feature selection method using Yu's similarity and test this using similarity classifier. As the similarity classifier we used Yu's similarity, we tested our similarity on the real world data set which is dermatological data set. By performing feature selection based on fuzzy entropy measures before classification on our data set the empirical results were very promising, the highest classification accuracy of 98.83% was achieved when testing our similarity measure to the data set. The achieved results were then compared with some other results previously obtained using different similarity classifiers, the obtained results show better accuracy than the one achieved before. The used methods helped to reduce the dimensionality of the used data set, to speed up the computation time of a learning algorithm and therefore have simplified the classification task
Resumo:
The purpose of the thesis is to classify suppliers and to enhance strategic purchasing in the case company. Supplier classification is conducted to fulfill the requirements of the company quality manual and international quality standards. To gain more benefit, a strategic purchasing tool, Kraljic’s purchasing portfolio and analytical hierarchy process are utilized for the base of supplier classification. Purchasing portfolio is used to give quick and easy visual insight on product group management form the viewpoint of purchasing. From the base on purchasing portfolio alternative purchasing and supplier strategies can be formed that enhance the strategic orientation of purchasing. Thus purchasing portfolio forces the company to orient on proactive and strategic purchasing. As a result a survey method for implementing purchasing portfolio in the company is developed that exploits analytical hierarchy process. Experts from the company appoint the categorization criteria and in addition, participate in the survey to categorize product groups on the portfolio. Alternative purchasing strategies are formed. Suppliers are classified depending on the importance and characteristics of the product groups supplied.