820 resultados para Data classification
                                
Resumo:
The objective of this master’s thesis was twofold: first to examine the concept of customer value and its drivers and second to identify information use practices. The first part of the study represents explorative research that was carried out by examining a case company’s customer satisfaction data that was used to identify sales and technical customer service related value drivers on a detailed attribute level. This was followed by an examination of whether these attributes had been commented on in a positive or a negative light and what were the reasons why the case company had received higher or lower ratings than its competitor. As a result a classification of different sales and technical customer service related attributes was created. The results indicated that the case company has performed well, but that the results varied on the company’s business segment level. The case company’s staff, service and the benefits from a long-lasting relationship came up in a positive light whereas attitude, flexibility and reaction time came up in a negative light. The reasons for a higher or lower score in comparison to competitor varied. The results indicated that a customer’s satisfaction with the company’s performance did not always mean that the company was outperforming the competition. The second part of the study focused on customer satisfaction information use from the viewpoints of information access, dissemination and reaction. The study was conducted by running an internal survey among the case company’s staff. The results showed that information use practices varied across the company and some units or teams had taken a more proactive approach to the information use than others.
                                
Resumo:
Longitudinal surveys are increasingly used to collect event history data on person-specific processes such as transitions between labour market states. Surveybased event history data pose a number of challenges for statistical analysis. These challenges include survey errors due to sampling, non-response, attrition and measurement. This study deals with non-response, attrition and measurement errors in event history data and the bias caused by them in event history analysis. The study also discusses some choices faced by a researcher using longitudinal survey data for event history analysis and demonstrates their effects. These choices include, whether a design-based or a model-based approach is taken, which subset of data to use and, if a design-based approach is taken, which weights to use. The study takes advantage of the possibility to use combined longitudinal survey register data. The Finnish subset of European Community Household Panel (FI ECHP) survey for waves 1–5 were linked at person-level with longitudinal register data. Unemployment spells were used as study variables of interest. Lastly, a simulation study was conducted in order to assess the statistical properties of the Inverse Probability of Censoring Weighting (IPCW) method in a survey data context. The study shows how combined longitudinal survey register data can be used to analyse and compare the non-response and attrition processes, test the missingness mechanism type and estimate the size of bias due to non-response and attrition. In our empirical analysis, initial non-response turned out to be a more important source of bias than attrition. Reported unemployment spells were subject to seam effects, omissions, and, to a lesser extent, overreporting. The use of proxy interviews tended to cause spell omissions. An often-ignored phenomenon classification error in reported spell outcomes, was also found in the data. Neither the Missing At Random (MAR) assumption about non-response and attrition mechanisms, nor the classical assumptions about measurement errors, turned out to be valid. Both measurement errors in spell durations and spell outcomes were found to cause bias in estimates from event history models. Low measurement accuracy affected the estimates of baseline hazard most. The design-based estimates based on data from respondents to all waves of interest and weighted by the last wave weights displayed the largest bias. Using all the available data, including the spells by attriters until the time of attrition, helped to reduce attrition bias. Lastly, the simulation study showed that the IPCW correction to design weights reduces bias due to dependent censoring in design-based Kaplan-Meier and Cox proportional hazard model estimators. The study discusses implications of the results for survey organisations collecting event history data, researchers using surveys for event history analysis, and researchers who develop methods to correct for non-sampling biases in event history data.
                                
Resumo:
This thesis studies the development of service offering model that creates added-value for customers in the field of logistics services. The study focusses on offering classification and structures of model. The purpose of model is to provide value-added solutions for customers and enable superior service experience. The aim of thesis is to define what customers expect from logistics solution provider and what value customers appreciate so greatly that they could invest in value-added services. Value propositions, costs structures of offerings and appropriate pricing methods are studied. First, literature review of creating solution business model and customer value is conducted. Customer value is found out with customer interviews and qualitative empiric data is used. To exploit expertise knowledge of logistics, innovation workshop tool is utilized. Customers and experts are involved in the design process of model. As a result of thesis, three-level value-added service offering model is created based on empiric and theoretical data. Offerings with value propositions are proposed and the level of model reflects the deepness of customer-provider relationship and the amount of added value. Performance efficiency improvements and cost savings create the most added value for customers. Value-based pricing methods, such as performance-based models are suggested to apply. Results indicate the interest of benefitting networks and partnership in field of logistics services. Networks development is proposed to be investigated further.
                                
Resumo:
The predominant type of liver alteration in asymptomatic or oligosymptomatic chronic male alcoholics (N = 169) admitted to a psychiatric hospital for detoxification was classified by two independent methods: liver palpation and multiple quadratic discriminant analysis (QDA), the latter applied to two parameters reported by the patient (duration of alcoholism and daily amount ingested) and to the data obtained from eight biochemical blood determinations (total bilirubin, alkaline phosphatase, glycemia, potassium, aspartate aminotransferase, albumin, globulin, and sodium). All 11 soft and sensitive, and 13 firm and sensitive livers formed fully concordant groups as determined by QDA. Among the 22 soft and not sensitive livers, 95% were concordant by QDA grouping. Concordance rates were low (55%) in the 73 firm and not sensitive livers, and intermediate (76%) in the 50 not palpable livers. Prediction of the liver palpation characteristics by QDA was 95% correct for the firm and not sensitive livers and moderate for the other groups. On a preliminary basis, the variables considered to be most informative by QDA were the two anamnestic data and bilirubin levels, followed by alkaline phosphatase, glycemia and potassium, and then by aspartate aminotransferase and albumin. We conclude that, when biopsies would be too costly or potentially injurious to the patients to varying extents, clinical data could be considered valid to guide patient care, at least in the three groups (soft, not sensitive; soft, sensitive; firm, sensitive livers) in which the two noninvasive procedures were highly concordant in the present study.
                                
Resumo:
Since the times preceding the Second World War the subject of aircraft tracking has been a core interest to both military and non-military aviation. During subsequent years both technology and configuration of the radars allowed the users to deploy it in numerous fields, such as over-the-horizon radar, ballistic missile early warning systems or forward scatter fences. The latter one was arranged in a bistatic configuration. The bistatic radar has continuously re-emerged over the last eighty years for its intriguing capabilities and challenging configuration and formulation. The bistatic radar arrangement is used as the basis of all the analyzes presented in this work. The aircraft tracking method of VHF Doppler-only information, developed in the first part of this study, is solely based on Doppler frequency readings in relation to time instances of their appearance. The corresponding inverse problem is solved by utilising a multistatic radar scenario with two receivers and one transmitter and using their frequency readings as a base for aircraft trajectory estimation. The quality of the resulting trajectory is then compared with ground-truth information based on ADS-B data. The second part of the study deals with the developement of a method for instantaneous Doppler curve extraction from within a VHF time-frequency representation of the transmitted signal, with a three receivers and one transmitter configuration, based on a priori knowledge of the probability density function of the first order derivative of the Doppler shift, and on a system of blocks for identifying, classifying and predicting the Doppler signal. The extraction capabilities of this set-up are tested with a recorded TV signal and simulated synthetic spectrograms. Further analyzes are devoted to more comprehensive testing of the capabilities of the extraction method. Besides testing the method, the classification of aircraft is performed on the extracted Bistatic Radar Cross Section profiles and the correlation between them for different types of aircraft. In order to properly estimate the profiles, the ADS-B aircraft location information is adjusted based on extracted Doppler frequency and then used for Bistatic Radar Cross Section estimation. The classification is based on seven types of aircraft grouped by their size into three classes.
                                
Resumo:
Malnutrition constitutes a major public health concern worldwide and serves as an indicator of hospitalized patients’ prognosis. Although various methods with which to conduct nutritional assessments exist, large hospitals seldom employ them to diagnose malnutrition. The aim of this study was to understand the prevalence of child malnutrition at the University Hospital of the Ribeirão Preto Medical School, University of São, Brazil. A cross-sectional descriptive study was conducted to compare the nutritional status of 292 hospitalized children with that of a healthy control group (n=234). Information regarding patients’ weight, height, and bioelectrical impedance (i.e., bioelectrical impedance vector analysis) was obtained, and the phase angle was calculated. Using the World Health Organization (WHO) criteria, 35.27% of the patients presented with malnutrition; specifically, 16.10% had undernutrition and 19.17% were overweight. Classification according to the bioelectrical impedance results of nutritional status was more sensitive than the WHO criteria: of the 55.45% of patients with malnutrition, 51.25% exhibited undernutrition and 4.20% were overweight. After applying the WHO criteria in the unpaired control group (n=234), we observed that 100.00% of the subjects were eutrophic; however, 23.34% of the controls were malnourished according to impedance analysis. The phase angle was significantly lower in the hospitalized group than in the control group (P<0.05). Therefore, this study suggests that a protocol to obtain patients’ weight and height must be followed, and bioimpedance data must be examined upon hospital admission of all children.
                                
Resumo:
In the global phenomenon, the aging population becomes a critical issue. Data and information concerning elderly citizens are increasing and are not well organized. In addition, these unstructured data and information cause the problems for decision makers. Since we live in a digital world, Information Technology is considered to be a tool in order to solve problems. Data, information, and knowledge are crucial components to facilitate success in IT service system. Therefore, it is necessary to study how to organize or to govern data from various sources related elderly citizens. The research is conducted due to the fact that there is no internationally accepted holistic framework for governance of data. The research limits the scope to study on the healthcare domain; however, the results can be applied to the other areas. The research starts with an ongoing research of Dahlberg and Nokkala (2015) as a theory. It explains the classification of existing data sources and their characteristics with the focus on managerial perspectives. Then the studies of existing frameworks at international and national level organizations have been performed to show the current frameworks, which have been used and are useful in compiling data on elderly citizens. The international organizations in this research are selected based on their reputations and the reliability to obtain information. The selected countries at national level provide different point of views between two countries. Australia is a forerunner in IT governance while Thailand is the country which the author has familiar knowledge of the current situation. Considered the discussions of frameworks at international and national organizations level illustrate the main characteristics of each framework. At international organization level gives precedence to the interoperability of exchanging data and information between different parties. Whereas at national level shows the importance of the acknowledgement of using frameworks throughout the country in order to make the frameworks to be effective. After the studies of both international and national organization levels, the thesis shows the summarized tables to answer the fitness to the proposed framework by Dahlberg and Nokkala whether the framework help to consolidate data from various sources with different formats, hierarchies, structures, velocities, and other attributes of data storages. In addition, suggestions and recommendations will be proposed for the future research.
                                
Resumo:
The main focus of this thesis is to evaluate and compare Hyperbalilearning algorithm (HBL) to other learning algorithms. In this work HBL is compared to feed forward artificial neural networks using back propagation learning, K-nearest neighbor and 103 algorithms. In order to evaluate the similarity of these algorithms, we carried out three experiments using nine benchmark data sets from UCI machine learning repository. The first experiment compares HBL to other algorithms when sample size of dataset is changing. The second experiment compares HBL to other algorithms when dimensionality of data changes. The last experiment compares HBL to other algorithms according to the level of agreement to data target values. Our observations in general showed, considering classification accuracy as a measure, HBL is performing as good as most ANn variants. Additionally, we also deduced that HBL.:s classification accuracy outperforms 103's and K-nearest neighbour's for the selected data sets.
                                
Resumo:
Genetic Programming (GP) is a widely used methodology for solving various computational problems. GP's problem solving ability is usually hindered by its long execution times. In this thesis, GP is applied toward real-time computer vision. In particular, object classification and tracking using a parallel GP system is discussed. First, a study of suitable GP languages for object classification is presented. Two main GP approaches for visual pattern classification, namely the block-classifiers and the pixel-classifiers, were studied. Results showed that the pixel-classifiers generally performed better. Using these results, a suitable language was selected for the real-time implementation. Synthetic video data was used in the experiments. The goal of the experiments was to evolve a unique classifier for each texture pattern that existed in the video. The experiments revealed that the system was capable of correctly tracking the textures in the video. The performance of the system was on-par with real-time requirements.
                                
Resumo:
The curse of dimensionality is a major problem in the fields of machine learning, data mining and knowledge discovery. Exhaustive search for the most optimal subset of relevant features from a high dimensional dataset is NP hard. Sub–optimal population based stochastic algorithms such as GP and GA are good choices for searching through large search spaces, and are usually more feasible than exhaustive and deterministic search algorithms. On the other hand, population based stochastic algorithms often suffer from premature convergence on mediocre sub–optimal solutions. The Age Layered Population Structure (ALPS) is a novel metaheuristic for overcoming the problem of premature convergence in evolutionary algorithms, and for improving search in the fitness landscape. The ALPS paradigm uses an age–measure to control breeding and competition between individuals in the population. This thesis uses a modification of the ALPS GP strategy called Feature Selection ALPS (FSALPS) for feature subset selection and classification of varied supervised learning tasks. FSALPS uses a novel frequency count system to rank features in the GP population based on evolved feature frequencies. The ranked features are translated into probabilities, which are used to control evolutionary processes such as terminal–symbol selection for the construction of GP trees/sub-trees. The FSALPS metaheuristic continuously refines the feature subset selection process whiles simultaneously evolving efficient classifiers through a non–converging evolutionary process that favors selection of features with high discrimination of class labels. We investigated and compared the performance of canonical GP, ALPS and FSALPS on high–dimensional benchmark classification datasets, including a hyperspectral image. Using Tukey’s HSD ANOVA test at a 95% confidence interval, ALPS and FSALPS dominated canonical GP in evolving smaller but efficient trees with less bloat expressions. FSALPS significantly outperformed canonical GP and ALPS and some reported feature selection strategies in related literature on dimensionality reduction.
                                
Resumo:
The curse of dimensionality is a major problem in the fields of machine learning, data mining and knowledge discovery. Exhaustive search for the most optimal subset of relevant features from a high dimensional dataset is NP hard. Sub–optimal population based stochastic algorithms such as GP and GA are good choices for searching through large search spaces, and are usually more feasible than exhaustive and determinis- tic search algorithms. On the other hand, population based stochastic algorithms often suffer from premature convergence on mediocre sub–optimal solutions. The Age Layered Population Structure (ALPS) is a novel meta–heuristic for overcoming the problem of premature convergence in evolutionary algorithms, and for improving search in the fitness landscape. The ALPS paradigm uses an age–measure to control breeding and competition between individuals in the population. This thesis uses a modification of the ALPS GP strategy called Feature Selection ALPS (FSALPS) for feature subset selection and classification of varied supervised learning tasks. FSALPS uses a novel frequency count system to rank features in the GP population based on evolved feature frequencies. The ranked features are translated into probabilities, which are used to control evolutionary processes such as terminal–symbol selection for the construction of GP trees/sub-trees. The FSALPS meta–heuristic continuously refines the feature subset selection process whiles simultaneously evolving efficient classifiers through a non–converging evolutionary process that favors selection of features with high discrimination of class labels. We investigated and compared the performance of canonical GP, ALPS and FSALPS on high–dimensional benchmark classification datasets, including a hyperspectral image. Using Tukey’s HSD ANOVA test at a 95% confidence interval, ALPS and FSALPS dominated canonical GP in evolving smaller but efficient trees with less bloat expressions. FSALPS significantly outperformed canonical GP and ALPS and some reported feature selection strategies in related literature on dimensionality reduction.
                                
Resumo:
Affiliation: Centre Robert-Cedergren de l'Université de Montréal en bio-informatique et génomique & Département de biochimie, Université de Montréal
                                
Resumo:
Les employés d’un organisme utilisent souvent un schéma de classification personnel pour organiser les documents électroniques qui sont sous leur contrôle direct, ce qui suggère la difficulté pour d’autres employés de repérer ces documents et la perte possible de documentation pour l’organisme. Aucune étude empirique n’a été menée à ce jour afin de vérifier dans quelle mesure les schémas de classification personnels permettent, ou même facilitent, le repérage des documents électroniques par des tiers, dans le cadre d’un travail collaboratif par exemple, ou lorsqu’il s’agit de reconstituer un dossier. Le premier objectif de notre recherche était de décrire les caractéristiques de schémas de classification personnels utilisés pour organiser et classer des documents administratifs électroniques. Le deuxième objectif consistait à vérifier, dans un environnement contrôlé, les différences sur le plan de l’efficacité du repérage de documents électroniques qui sont fonction du schéma de classification utilisé. Nous voulions vérifier s’il était possible de repérer un document avec la même efficacité, quel que soit le schéma de classification utilisé pour ce faire. Une collecte de données en deux étapes fut réalisée pour atteindre ces objectifs. Nous avons d’abord identifié les caractéristiques structurelles, logiques et sémantiques de 21 schémas de classification utilisés par des employés de l’Université de Montréal pour organiser et classer les documents électroniques qui sont sous leur contrôle direct. Par la suite, nous avons comparé, à partir d'une expérimentation contrôlée, la capacité d’un groupe de 70 répondants à repérer des documents électroniques à l’aide de cinq schémas de classification ayant des caractéristiques structurelles, logiques et sémantiques variées. Trois variables ont été utilisées pour mesurer l’efficacité du repérage : la proportion de documents repérés, le temps moyen requis (en secondes) pour repérer les documents et la proportion de documents repérés dès le premier essai. Les résultats révèlent plusieurs caractéristiques structurelles, logiques et sémantiques communes à une majorité de schémas de classification personnels : macro-structure étendue, structure peu profonde, complexe et déséquilibrée, regroupement par thème, ordre alphabétique des classes, etc. Les résultats des tests d’analyse de la variance révèlent des différences significatives sur le plan de l’efficacité du repérage de documents électroniques qui sont fonction des caractéristiques structurelles, logiques et sémantiques du schéma de classification utilisé. Un schéma de classification caractérisé par une macro-structure peu étendue et une logique basée partiellement sur une division par classes d’activités augmente la probabilité de repérer plus rapidement les documents. Au plan sémantique, une dénomination explicite des classes (par exemple, par utilisation de définitions ou en évitant acronymes et abréviations) augmente la probabilité de succès au repérage. Enfin, un schéma de classification caractérisé par une macro-structure peu étendue, une logique basée partiellement sur une division par classes d’activités et une sémantique qui utilise peu d’abréviations augmente la probabilité de repérer les documents dès le premier essai.
                                
Resumo:
Les modèles à sur-représentation de zéros discrets et continus ont une large gamme d'applications et leurs propriétés sont bien connues. Bien qu'il existe des travaux portant sur les modèles discrets à sous-représentation de zéro et modifiés à zéro, la formulation usuelle des modèles continus à sur-représentation -- un mélange entre une densité continue et une masse de Dirac -- empêche de les généraliser afin de couvrir le cas de la sous-représentation de zéros. Une formulation alternative des modèles continus à sur-représentation de zéros, pouvant aisément être généralisée au cas de la sous-représentation, est présentée ici. L'estimation est d'abord abordée sous le paradigme classique, et plusieurs méthodes d'obtention des estimateurs du maximum de vraisemblance sont proposées. Le problème de l'estimation ponctuelle est également considéré du point de vue bayésien. Des tests d'hypothèses classiques et bayésiens visant à déterminer si des données sont à sur- ou sous-représentation de zéros sont présentées. Les méthodes d'estimation et de tests sont aussi évaluées au moyen d'études de simulation et appliquées à des données de précipitation agrégées. Les diverses méthodes s'accordent sur la sous-représentation de zéros des données, démontrant la pertinence du modèle proposé. Nous considérons ensuite la classification d'échantillons de données à sous-représentation de zéros. De telles données étant fortement non normales, il est possible de croire que les méthodes courantes de détermination du nombre de grappes s'avèrent peu performantes. Nous affirmons que la classification bayésienne, basée sur la distribution marginale des observations, tiendrait compte des particularités du modèle, ce qui se traduirait par une meilleure performance. Plusieurs méthodes de classification sont comparées au moyen d'une étude de simulation, et la méthode proposée est appliquée à des données de précipitation agrégées provenant de 28 stations de mesure en Colombie-Britannique.
                                
Resumo:
Dans le domaine des neurosciences computationnelles, l'hypothèse a été émise que le système visuel, depuis la rétine et jusqu'au cortex visuel primaire au moins, ajuste continuellement un modèle probabiliste avec des variables latentes, à son flux de perceptions. Ni le modèle exact, ni la méthode exacte utilisée pour l'ajustement ne sont connus, mais les algorithmes existants qui permettent l'ajustement de tels modèles ont besoin de faire une estimation conditionnelle des variables latentes. Cela nous peut nous aider à comprendre pourquoi le système visuel pourrait ajuster un tel modèle; si le modèle est approprié, ces estimé conditionnels peuvent aussi former une excellente représentation, qui permettent d'analyser le contenu sémantique des images perçues. Le travail présenté ici utilise la performance en classification d'images (discrimination entre des types d'objets communs) comme base pour comparer des modèles du système visuel, et des algorithmes pour ajuster ces modèles (vus comme des densités de probabilité) à des images. Cette thèse (a) montre que des modèles basés sur les cellules complexes de l'aire visuelle V1 généralisent mieux à partir d'exemples d'entraînement étiquetés que les réseaux de neurones conventionnels, dont les unités cachées sont plus semblables aux cellules simples de V1; (b) présente une nouvelle interprétation des modèles du système visuels basés sur des cellules complexes, comme distributions de probabilités, ainsi que de nouveaux algorithmes pour les ajuster à des données; et (c) montre que ces modèles forment des représentations qui sont meilleures pour la classification d'images, après avoir été entraînés comme des modèles de probabilités. Deux innovations techniques additionnelles, qui ont rendu ce travail possible, sont également décrites : un algorithme de recherche aléatoire pour sélectionner des hyper-paramètres, et un compilateur pour des expressions mathématiques matricielles, qui peut optimiser ces expressions pour processeur central (CPU) et graphique (GPU).
 
                    