17 resultados para Data classification
em Doria (National Library of Finland DSpace Services) - National Library of Finland, Finland
Resumo:
Työssä käydään läpi tukivektorikoneiden teoreettista pohjaa sekä tutkitaan eri parametrien vaikutusta spektridatan luokitteluun.
Resumo:
The objective of this thesis is to develop and generalize further the differential evolution based data classification method. For many years, evolutionary algorithms have been successfully applied to many classification tasks. Evolution algorithms are population based, stochastic search algorithms that mimic natural selection and genetics. Differential evolution is an evolutionary algorithm that has gained popularity because of its simplicity and good observed performance. In this thesis a differential evolution classifier with pool of distances is proposed, demonstrated and initially evaluated. The differential evolution classifier is a nearest prototype vector based classifier that applies a global optimization algorithm, differential evolution, to determine the optimal values for all free parameters of the classifier model during the training phase of the classifier. The differential evolution classifier applies the individually optimized distance measure for each new data set to be classified is generalized to cover a pool of distances. Instead of optimizing a single distance measure for the given data set, the selection of the optimal distance measure from a predefined pool of alternative measures is attempted systematically and automatically. Furthermore, instead of only selecting the optimal distance measure from a set of alternatives, an attempt is made to optimize the values of the possible control parameters related with the selected distance measure. Specifically, a pool of alternative distance measures is first created and then the differential evolution algorithm is applied to select the optimal distance measure that yields the highest classification accuracy with the current data. After determining the optimal distance measures for the given data set together with their optimal parameters, all determined distance measures are aggregated to form a single total distance measure. The total distance measure is applied to the final classification decisions. The actual classification process is still based on the nearest prototype vector principle; a sample belongs to the class represented by the nearest prototype vector when measured with the optimized total distance measure. During the training process the differential evolution algorithm determines the optimal class vectors, selects optimal distance metrics, and determines the optimal values for the free parameters of each selected distance measure. The results obtained with the above method confirm that the choice of distance measure is one of the most crucial factors for obtaining higher classification accuracy. The results also demonstrate that it is possible to build a classifier that is able to select the optimal distance measure for the given data set automatically and systematically. After finding optimal distance measures together with optimal parameters from the particular distance measure results are then aggregated to form a total distance, which will be used to form the deviation between the class vectors and samples and thus classify the samples. This thesis also discusses two types of aggregation operators, namely, ordered weighted averaging (OWA) based multi-distances and generalized ordered weighted averaging (GOWA). These aggregation operators were applied in this work to the aggregation of the normalized distance values. The results demonstrate that a proper combination of aggregation operator and weight generation scheme play an important role in obtaining good classification accuracy. The main outcomes of the work are the six new generalized versions of previous method called differential evolution classifier. All these DE classifier demonstrated good results in the classification tasks.
Resumo:
Monissasovelluksissa on hyvin tärkeää vähentää valolähteen vaikutusta kohteen oikean värin havainnoimiseksi. Tämä on tarpeen mm. virtuaalisissa museoissa, telelääketieteessä, verkkokaupassa ja verkkorahassa. Tässä tutkielmassa on kehitetty tekniikkaa kirkkaiden heijastusten poistoon spektrikuvista. Työ sisältää katsauksen yleisen värillisen kuvan ymmärtämiseen, mihin perustuen analysoitiin erilaisia kirkkaiden heijastusten poistO'tekniikoita. Työssä kehitettiin uusi kirkkaiden heijastusten poistO'menetelmä, joka perustuu dikromaattiseen heijastus-malliin, joka kuvaa spektrisen datan objektin omaan väriin ja valaisevan valon väriin perustuen. Ehdotettu kirkkaiden heijastusten poistO'menetelmä hyödyntää erilaisia olemassaolevia menetelmiä, kuten pääkomponenttimenetelmää ja tiedon luokittelu-menetelmää. Yritys kehittää nopeasti toimiva algoritmi, joka myös suoriutuu tehtävästä hyvin, on onnistunut. Kokeet toteutettiin ehdotetun menetelmän mukaisesti ja toimivalla algoritmilla saatiin halutut lopputulokset. Edelleentyö sisältää ehdotuksia esitetyn algoritmin parantamiseksi.
Resumo:
This thesis studies the properties and usability of operators called t-norms, t-conorms, uninorms, as well as many valued implications and equivalences. Into these operators, weights and a generalized mean are embedded for aggregation, and they are used for comparison tasks and for this reason they are referred to as comparison measures. The thesis illustrates how these operators can be weighted with a differential evolution and aggregated with a generalized mean, and the kinds of measures of comparison that can be achieved from this procedure. New operators suitable for comparison measures are suggested. These operators are combination measures based on the use of t-norms and t-conorms, the generalized 3_-uninorm and pseudo equivalence measures based on S-type implications. The empirical part of this thesis demonstrates how these new comparison measures work in the field of classification, for example, in the classification of medical data. The second application area is from the field of sports medicine and it represents an expert system for defining an athlete's aerobic and anaerobic thresholds. The core of this thesis offers definitions for comparison measures and illustrates that there is no actual difference in the results achieved in comparison tasks, by the use of comparison measures based on distance, versus comparison measures based on many valued logical structures. The approach has been highly practical in this thesis and all usage of the measures has been validated mainly by practical testing. In general, many different types of operators suitable for comparison tasks have been presented in fuzzy logic literature and there has been little or no experimental work with these operators.
Resumo:
The main objective of this study was todo a statistical analysis of ecological type from optical satellite data, using Tipping's sparse Bayesian algorithm. This thesis uses "the Relevence Vector Machine" algorithm in ecological classification betweenforestland and wetland. Further this bi-classification technique was used to do classification of many other different species of trees and produces hierarchical classification of entire subclasses given as a target class. Also, we carried out an attempt to use airborne image of same forest area. Combining it with image analysis, using different image processing operation, we tried to extract good features and later used them to perform classification of forestland and wetland.
Resumo:
Luokittelujärjestelmää suunniteltaessa tarkoituksena on rakentaa systeemi, joka pystyy ratkaisemaan mahdollisimman tarkasti tutkittavan ongelma-alueen. Hahmontunnistuksessa tunnistusjärjestelmän ydin on luokitin. Luokittelun sovellusaluekenttä on varsin laaja. Luokitinta tarvitaan mm. hahmontunnistusjärjestelmissä, joista kuvankäsittely toimii hyvänä esimerkkinä. Myös lääketieteen parissa tarkkaa luokittelua tarvitaan paljon. Esimerkiksi potilaan oireiden diagnosointiin tarvitaan luokitin, joka pystyy mittaustuloksista päättelemään mahdollisimman tarkasti, onko potilaalla kyseinen oire vai ei. Väitöskirjassa on tehty similaarisuusmittoihin perustuva luokitin ja sen toimintaa on tarkasteltu mm. lääketieteen paristatulevilla data-aineistoilla, joissa luokittelutehtävänä on tunnistaa potilaan oireen laatu. Väitöskirjassa esitetyn luokittimen etuna on sen yksinkertainen rakenne, josta johtuen se on helppo tehdä sekä ymmärtää. Toinen etu on luokittimentarkkuus. Luokitin saadaan luokittelemaan useita eri ongelmia hyvin tarkasti. Tämä on tärkeää varsinkin lääketieteen parissa, missä jo pieni tarkkuuden parannus luokittelutuloksessa on erittäin tärkeää. Väitöskirjassa ontutkittu useita eri mittoja, joilla voidaan mitata samankaltaisuutta. Mitoille löytyy myös useita parametreja, joille voidaan etsiä juuri kyseiseen luokitteluongelmaan sopivat arvot. Tämä parametrien optimointi ongelma-alueeseen sopivaksi voidaan suorittaa mm. evoluutionääri- algoritmeja käyttäen. Kyseisessä työssä tähän on käytetty geneettistä algoritmia ja differentiaali-evoluutioalgoritmia. Luokittimen etuna on sen joustavuus. Ongelma-alueelle on helppo vaihtaa similaarisuusmitta, jos kyseinen mitta ei ole sopiva tutkittavaan ongelma-alueeseen. Myös eri mittojen parametrien optimointi voi parantaa tuloksia huomattavasti. Kun käytetään eri esikäsittelymenetelmiä ennen luokittelua, tuloksia pystytään parantamaan.
Resumo:
In this thesis author approaches the problem of automated text classification, which is one of basic tasks for building Intelligent Internet Search Agent. The work discusses various approaches to solving sub-problems of automated text classification, such as feature extraction and machine learning on text sources. Author also describes her own multiword approach to feature extraction and pres-ents the results of testing this approach using linear discriminant analysis based classifier, and classifier combining unsupervised learning for etalon extraction with supervised learning using common backpropagation algorithm for multilevel perceptron.
Resumo:
The main objective of the study is to form a framework that provides tools to recognise and classify items whose demand is not smooth but varies highly on size and/or frequency. The framework will then be combined with two other classification methods in order to form a three-dimensional classification model. Forecasting and inventory control of these abnormal demand items is difficult. Therefore another object of this study is to find out which statistical forecasting method is most suitable for forecasting of abnormal demand items. The accuracy of different methods is measured by comparing the forecast to the actual demand. Moreover, the study also aims at finding proper alternatives to the inventory control of abnormal demand items. The study is quantitative and the methodology is a case study. The research methods consist of theory, numerical data, current state analysis and testing of the framework in case company. The results of the study show that the framework makes it possible to recognise and classify the abnormal demand items. It is also noticed that the inventory performance of abnormal demand items differs significantly from the performance of smoothly demanded items. This makes the recognition of abnormal demand items very important.
Resumo:
Recent advances in machine learning methods enable increasingly the automatic construction of various types of computer assisted methods that have been difficult or laborious to program by human experts. The tasks for which this kind of tools are needed arise in many areas, here especially in the fields of bioinformatics and natural language processing. The machine learning methods may not work satisfactorily if they are not appropriately tailored to the task in question. However, their learning performance can often be improved by taking advantage of deeper insight of the application domain or the learning problem at hand. This thesis considers developing kernel-based learning algorithms incorporating this kind of prior knowledge of the task in question in an advantageous way. Moreover, computationally efficient algorithms for training the learning machines for specific tasks are presented. In the context of kernel-based learning methods, the incorporation of prior knowledge is often done by designing appropriate kernel functions. Another well-known way is to develop cost functions that fit to the task under consideration. For disambiguation tasks in natural language, we develop kernel functions that take account of the positional information and the mutual similarities of words. It is shown that the use of this information significantly improves the disambiguation performance of the learning machine. Further, we design a new cost function that is better suitable for the task of information retrieval and for more general ranking problems than the cost functions designed for regression and classification. We also consider other applications of the kernel-based learning algorithms such as text categorization, and pattern recognition in differential display. We develop computationally efficient algorithms for training the considered learning machines with the proposed kernel functions. We also design a fast cross-validation algorithm for regularized least-squares type of learning algorithm. Further, an efficient version of the regularized least-squares algorithm that can be used together with the new cost function for preference learning and ranking tasks is proposed. In summary, we demonstrate that the incorporation of prior knowledge is possible and beneficial, and novel advanced kernels and cost functions can be used in algorithms efficiently.
Resumo:
In this work we study the classification of forest types using mathematics based image analysis on satellite data. We are interested in improving classification of forest segments when a combination of information from two or more different satellites is used. The experimental part is based on real satellite data originating from Canada. This thesis gives summary of the mathematics basics of the image analysis and supervised learning , methods that are used in the classification algorithm. Three data sets and four feature sets were investigated in this thesis. The considered feature sets were 1) histograms (quantiles) 2) variance 3) skewness and 4) kurtosis. Good overall performances were achieved when a combination of ASTERBAND and RADARSAT2 data sets was used.
Resumo:
Female sexual dysfunctions, including desire, arousal, orgasm and pain problems, have been shown to be highly prevalent among women around the world. The etiology of these dysfunctions is unclear but associations with health, age, psychological problems, and relationship factors have been identified. Genetic effects explain individual variation in orgasm function to some extent but until now quantitative behavior genetic analyses have not been applied to other sexual functions. In addition, behavior genetics can be applied to exploring the cause of any observed comorbidity between the dysfunctions. Discovering more about the etiology of the dysfunctions may further improve the classification systems which are currently under intense debate. The aims of the present thesis were to evaluate the psychometric properties of a Finnish-language version of a commonly used questionnaire for measuring female sexual function, the Female Sexual Function Index (FSFI), in order to investigate prevalence, comorbidity, and classification, and to explore the balance of genetic and environmental factors in the etiology as well as the associations of a number of biopsychosocial factors with female sexual functions. Female sexual functions were studied through survey methods in a population based sample of Finnish twins and their female siblings. There were two waves of data collection. The first data collection targeted 5,000 female twins aged 33–43 years and the second 7,680 female twins aged 18–33 and their over 18–year-old female siblings (n = 3,983). There was no overlap between the data collections. The combined overall response rate for both data collections was 53% (n = 8,868), with a better response rate in the second (57%) compared to the first (45%). In order to measure female sexual function, the FSFI was used. It includes 19 items which measure female sexual function during the previous four weeks in six subdomains; desire, subjective arousal, lubrication, orgasm, sexual satisfaction, and pain. In line with earlier research in clinical populations, a six factor solution of the Finnish-language version of the FSFI received supported. The internal consistencies of the scales were good to excellent. Some questions about how to avoid overestimating the prevalence of extreme dysfunctions due to women being allocated the score of zero if they had had no sexual activity during the preceding four weeks were raised. The prevalence of female sexual dysfunctions per se ranged from 11% for lubrication dysfunction to 55% for desire dysfunction. The prevalence rates for sexual dysfunction with concomitant sexual distress, in other words, sexual disorders were notably lower ranging from 7% for lubrication disorder to 23% for desire disorder. The comorbidity between the dysfunctions was substantial most notably between arousal and lubrication dysfunction even if these two dysfunctions showed distinct patterns of associations with the other dysfunctions. Genetic influences on individual variation in the six subdomains of FSFI were modest but significant ranging from 3–11% for additive genetic effects and 5–18% for nonadditive genetic effects. The rest of the variation in sexual functions was explained by nonshared environmental influences. A correlated factor model, including additive and nonadditive genetic effects and nonshared environmental effects had the best fit. All in all, every correlation between the genetic factors was significant except between lubrication and pain. All correlations between the nonshared environment factors were significant showing that there is a substantial overlap in genetic and nonshared environmental influences between the dysfunctions. In general, psychological problems, poor satisfaction with the relationship, sexual distress, and poor partner compatibility were associated with more sexual dysfunctions. Age was confounded with relationship length but had over and above relationship length a negative effect on desire and sexual satisfaction and a positive effect on orgasm and pain functions. Alcohol consumption in general was associated with better desire, arousal, lubrication, and orgasm function. Women pregnant with their first child had fewer pain problems than nulliparous nonpregnant women. Multiparous pregnant women had more orgasm problems compared to multiparous nonpregnant women. Having children was associated with less orgasm and pain problems. The conclusions were that desire, subjective arousal, lubrication, orgasm, sexual satisfaction, and pain are separate entities that have distinct associations with a number of different biopsychosocial factors. However, there is also considerable comorbidity between the dysfunctions which are explained by overlap in additive genetic, nonadditive genetic and nonshared environmental influences. Sexual dysfunctions are highly prevalent and are not always associated with sexual distress and this relationship might be moderated by a good relationship and compatibility with partner. Regarding classification, the results supports separate diagnoses for subjective arousal and genital arousal as well as the inclusion of pain under sexual dysfunctions.
Resumo:
The objective of this master’s thesis was twofold: first to examine the concept of customer value and its drivers and second to identify information use practices. The first part of the study represents explorative research that was carried out by examining a case company’s customer satisfaction data that was used to identify sales and technical customer service related value drivers on a detailed attribute level. This was followed by an examination of whether these attributes had been commented on in a positive or a negative light and what were the reasons why the case company had received higher or lower ratings than its competitor. As a result a classification of different sales and technical customer service related attributes was created. The results indicated that the case company has performed well, but that the results varied on the company’s business segment level. The case company’s staff, service and the benefits from a long-lasting relationship came up in a positive light whereas attitude, flexibility and reaction time came up in a negative light. The reasons for a higher or lower score in comparison to competitor varied. The results indicated that a customer’s satisfaction with the company’s performance did not always mean that the company was outperforming the competition. The second part of the study focused on customer satisfaction information use from the viewpoints of information access, dissemination and reaction. The study was conducted by running an internal survey among the case company’s staff. The results showed that information use practices varied across the company and some units or teams had taken a more proactive approach to the information use than others.
Resumo:
Longitudinal surveys are increasingly used to collect event history data on person-specific processes such as transitions between labour market states. Surveybased event history data pose a number of challenges for statistical analysis. These challenges include survey errors due to sampling, non-response, attrition and measurement. This study deals with non-response, attrition and measurement errors in event history data and the bias caused by them in event history analysis. The study also discusses some choices faced by a researcher using longitudinal survey data for event history analysis and demonstrates their effects. These choices include, whether a design-based or a model-based approach is taken, which subset of data to use and, if a design-based approach is taken, which weights to use. The study takes advantage of the possibility to use combined longitudinal survey register data. The Finnish subset of European Community Household Panel (FI ECHP) survey for waves 1–5 were linked at person-level with longitudinal register data. Unemployment spells were used as study variables of interest. Lastly, a simulation study was conducted in order to assess the statistical properties of the Inverse Probability of Censoring Weighting (IPCW) method in a survey data context. The study shows how combined longitudinal survey register data can be used to analyse and compare the non-response and attrition processes, test the missingness mechanism type and estimate the size of bias due to non-response and attrition. In our empirical analysis, initial non-response turned out to be a more important source of bias than attrition. Reported unemployment spells were subject to seam effects, omissions, and, to a lesser extent, overreporting. The use of proxy interviews tended to cause spell omissions. An often-ignored phenomenon classification error in reported spell outcomes, was also found in the data. Neither the Missing At Random (MAR) assumption about non-response and attrition mechanisms, nor the classical assumptions about measurement errors, turned out to be valid. Both measurement errors in spell durations and spell outcomes were found to cause bias in estimates from event history models. Low measurement accuracy affected the estimates of baseline hazard most. The design-based estimates based on data from respondents to all waves of interest and weighted by the last wave weights displayed the largest bias. Using all the available data, including the spells by attriters until the time of attrition, helped to reduce attrition bias. Lastly, the simulation study showed that the IPCW correction to design weights reduces bias due to dependent censoring in design-based Kaplan-Meier and Cox proportional hazard model estimators. The study discusses implications of the results for survey organisations collecting event history data, researchers using surveys for event history analysis, and researchers who develop methods to correct for non-sampling biases in event history data.
Resumo:
This thesis studies the development of service offering model that creates added-value for customers in the field of logistics services. The study focusses on offering classification and structures of model. The purpose of model is to provide value-added solutions for customers and enable superior service experience. The aim of thesis is to define what customers expect from logistics solution provider and what value customers appreciate so greatly that they could invest in value-added services. Value propositions, costs structures of offerings and appropriate pricing methods are studied. First, literature review of creating solution business model and customer value is conducted. Customer value is found out with customer interviews and qualitative empiric data is used. To exploit expertise knowledge of logistics, innovation workshop tool is utilized. Customers and experts are involved in the design process of model. As a result of thesis, three-level value-added service offering model is created based on empiric and theoretical data. Offerings with value propositions are proposed and the level of model reflects the deepness of customer-provider relationship and the amount of added value. Performance efficiency improvements and cost savings create the most added value for customers. Value-based pricing methods, such as performance-based models are suggested to apply. Results indicate the interest of benefitting networks and partnership in field of logistics services. Networks development is proposed to be investigated further.
Resumo:
Since the times preceding the Second World War the subject of aircraft tracking has been a core interest to both military and non-military aviation. During subsequent years both technology and configuration of the radars allowed the users to deploy it in numerous fields, such as over-the-horizon radar, ballistic missile early warning systems or forward scatter fences. The latter one was arranged in a bistatic configuration. The bistatic radar has continuously re-emerged over the last eighty years for its intriguing capabilities and challenging configuration and formulation. The bistatic radar arrangement is used as the basis of all the analyzes presented in this work. The aircraft tracking method of VHF Doppler-only information, developed in the first part of this study, is solely based on Doppler frequency readings in relation to time instances of their appearance. The corresponding inverse problem is solved by utilising a multistatic radar scenario with two receivers and one transmitter and using their frequency readings as a base for aircraft trajectory estimation. The quality of the resulting trajectory is then compared with ground-truth information based on ADS-B data. The second part of the study deals with the developement of a method for instantaneous Doppler curve extraction from within a VHF time-frequency representation of the transmitted signal, with a three receivers and one transmitter configuration, based on a priori knowledge of the probability density function of the first order derivative of the Doppler shift, and on a system of blocks for identifying, classifying and predicting the Doppler signal. The extraction capabilities of this set-up are tested with a recorded TV signal and simulated synthetic spectrograms. Further analyzes are devoted to more comprehensive testing of the capabilities of the extraction method. Besides testing the method, the classification of aircraft is performed on the extracted Bistatic Radar Cross Section profiles and the correlation between them for different types of aircraft. In order to properly estimate the profiles, the ADS-B aircraft location information is adjusted based on extracted Doppler frequency and then used for Bistatic Radar Cross Section estimation. The classification is based on seven types of aircraft grouped by their size into three classes.