Uppsatsen visar resultatet av en kartläggning och analys av datakvaliteten i databasen Vägdatabanken på Vägverket. Vi har studerat datakvaliteten som den visar sig hos slutanvändarna, i detta fall de som upphandlar drift och underhåll av statliga vägar. Målet var att se hur väl Vägverkets krav på data stämmer överens med användarnas behov. För att nå målet skapades en modell med sju faser. Centralt var kartläggningen av verksamheten med hjälp av intervjuer, som blev grunden för vidare analyser. En egen modell var nödvändig eftersom arbetssättet, att utgå från användarna, inte är det vedertagna sättet att undersöka datakvalitet.Resultatet visar en rad kostnader som kan härröras till bristande datakvalitet varav många är dolda. Trovärdigheten är viktig i samband med datakvalitet. Användarna litar inte på data vilket skapar direkta kostnader, merarbete med mera. Det beror i sin tur på att det brister i insamlingen av data. Ett problem är att en rad olika arbetssätt och förhållningssätt till data i Vägdatabanken förekommer. Därför rekommenderas att Vägverket beslutar vilket arbetssätt som är ekonomiskt och funktionellt bäst.Syftet var att beskriva begreppet datakvalitet. Resultatet av praktiskt arbete och teoretiska studier ger definitionen "Datakvalitet är datas förmåga att vara förståelig och trovärdig samt tillräckligt väl uppfylla användarens behov".


När man kombinerar ett objektorienterat programmeringsspråk och en relationsdatabas uppstår en del problem för utvecklare eftersom objektorienterade programmeringsspråk och relationsdatabaser har olika fokus, objektorienterade programmeringsspråk fokuserar på att avbilda verkliga objekt och relationsdatabaser fokuserar på data. De problem som uppstår kallas med ett samlingsnamn för object-relational mismatch. Det finns flertalet ramverk för att hantera dessa problem. Ett av dem är Entity Framework.Syftet med detta projekt var att utvärdera hur utvecklare tycker att Entity Framework fungerar för att lösa problematiken runt object-relational mismatch, hur det är för utvecklare att lära sig använda Entity Framework samt hur tillgången på inlärningsmaterial är.Under vår studie har vi lärt oss använda Entity Framework samtidigt som vi gjort en studie av tillgången på inlärningsmaterial. Vi har också byggt om en applikation så att den använder Entity Framework. Vi har jämfört den ombyggda applikationen med den gamla applikationen för att kunna se vilken skillnad som Entity Framework bidrog till.Vi kom fram till att Entity Framework hanterar object-relational mismatch på ett bra sätt som bland annat gör att utvecklingsprocessen kortas ner då inte lika mycket kod behöver skrivas. Utvecklare med tidigare kunskaper i .NET-programmering upplever att det är lätt att lära sig Entity Framework. Att det upplevs lätt att lära sig Entity Framework hänger förmodligen ihop med att tillgången på inlärningsmaterial är god.


This paper looks at the theoretical conditions underpinning unique localization of synchronized multiple emitters using Time-of-Arrival measurements subjected to the data-association problem. The necessary fundamental requirements to solve the so-called ghost node problem associated with sensor arrays are examined. We derive a measurement bound for ideal situations and the underlying concepts are illustrated via simulations.


In the context of collaborative filtering, the well known data sparsity issue makes two like-minded users have little similarity, and consequently renders the k nearest neighbour rule inapplicable. In this paper, we address the data sparsity problem in the neighbourhood-based CF methods by proposing an Adaptive-Maximum imputation method (AdaM). The basic idea is to identify an imputation area that can maximize the imputation benefit for recommendation purposes, while minimizing the imputation error brought in. To achieve the maximum imputation benefit, the imputation area is determined from both the user and the item perspectives; to minimize the imputation error, there is at least one real rating preserved for each item in the identified imputation area. A theoretical analysis is provided to prove that the proposed imputation method outperforms the conventional neighbourhood-based CF methods through more accurate neighbour identification. Experiment results on benchmark datasets show that the proposed method significantly outperforms the other related state-of-the-art imputation-based methods in terms of accuracy.


In this paper, we observe that the user preference styles tend to change regularly following certain patterns. Therefore, we propose a Preference Pattern model to capture the user preference styles and their temporal dynamics, and apply this model to improve the accuracy of the Top-N recommendation. Precisely, a preference pattern is defined as a set of user preference styles sorted in a time order. The basic idea is to model user preference styles and their temporal dynamics by constructing a representative subspace with an Expectation- Maximization (EM)-like algorithm, which works in an iterative fashion by refining the global and the personal preference styles simultaneously. Then, the degree which the recommendations match the active user's preference styles, can be estimated by measuring its reconstruction error from its projection on the representative subspace. The experiment results indicate that the proposed model is robust to the data sparsity problem, and can significantly outperform the state-of-the-art algorithms on the Top-N recommendation in terms of accuracy. © 2012 IEEE.


Modelling the temporal dynamics of personal preferences is still under-developed despite the rapid development of personalization. In this paper, we observe that the user preference styles tend to change regularly following certain patterns in the context of movie recommendation systems. Therefore, we propose a Preference Pattern model to capture the user preference styles and their temporal dynamics, and apply this model to improve the accuracy of the Top-N movie recommendations. Precisely, a preference pattern is defined as a set of user preference styles sorted in a time order. The basic idea is to model user preference styles and their temporal dynamics by constructing a representative subspace with an Expectation-Maximization (EM)-like algorithm, which works in an iterative fashion by refining the global and the personal preference styles simultaneously. Then, the degree which the recommendations match the active user's preference styles, can be estimated by measuring its reconstruction error from its projection on the representative subspace. The experiment results indicate that the proposed model is robust to the data sparsity problem, and can significantly outperform the state-of-the-art algorithms on the Top-N movie recommendations in terms of accuracy.


This thesis analyses problems related to the applicability, in business environments, of Process Mining tools and techniques. The first contribution is a presentation of the state of the art of Process Mining and a characterization of companies, in terms of their "process awareness". The work continues identifying circumstance where problems can emerge: data preparation; actual mining; and results interpretation. Other problems are the configuration of parameters by not-expert users and computational complexity. We concentrate on two possible scenarios: "batch" and "on-line" Process Mining. Concerning the batch Process Mining, we first investigated the data preparation problem and we proposed a solution for the identification of the "case-ids" whenever this field is not explicitly indicated. After that, we concentrated on problems at mining time and we propose the generalization of a well-known control-flow discovery algorithm in order to exploit non instantaneous events. The usage of interval-based recording leads to an important improvement of performance. Later on, we report our work on the parameters configuration for not-expert users. We present two approaches to select the "best" parameters configuration: one is completely autonomous; the other requires human interaction to navigate a hierarchy of candidate models. Concerning the data interpretation and results evaluation, we propose two metrics: a model-to-model and a model-to-log. Finally, we present an automatic approach for the extension of a control-flow model with social information, in order to simplify the analysis of these perspectives. The second part of this thesis deals with control-flow discovery algorithms in on-line settings. We propose a formal definition of the problem, and two baseline approaches. The actual mining algorithms proposed are two: the first is the adaptation, to the control-flow discovery problem, of a frequency counting algorithm; the second constitutes a framework of models which can be used for different kinds of streams (stationary versus evolving).


The focus of this thesis is to contribute to the development of new, exact solution approaches to different combinatorial optimization problems. In particular, we derive dedicated algorithms for a special class of Traveling Tournament Problems (TTPs), the Dial-A-Ride Problem (DARP), and the Vehicle Routing Problem with Time Windows and Temporal Synchronized Pickup and Delivery (VRPTWTSPD). Furthermore, we extend the concept of using dual-optimal inequalities for stabilized Column Generation (CG) and detail its application to improved CG algorithms for the cutting stock problem, the bin packing problem, the vertex coloring problem, and the bin packing problem with conflicts. In all approaches, we make use of some knowledge about the structure of the problem at hand to individualize and enhance existing algorithms. Specifically, we utilize knowledge about the input data (TTP), problem-specific constraints (DARP and VRPTWTSPD), and the dual solution space (stabilized CG). Extensive computational results proving the usefulness of the proposed methods are reported.


With substance abuse treatment expanding in prisons and jails, understanding how behavior change interacts with a restricted setting becomes more essential. The Transtheoretical Model (TTM) has been used to understand intentional behavior change in unrestricted settings, however, evidence indicates restrictive settings can affect the measurement and structure of the TTM constructs. The present study examined data from problem drinkers at baseline and end-of-treatment from three studies: (1) Project CARE (n = 187) recruited inmates from a large county jail; (2) Project Check-In (n = 116) recruited inmates from a state prison; (3) Project MATCH, a large multi-site alcohol study had two recruitment arms, aftercare (n = 724 pre-treatment and 650 post-treatment) and outpatient (n = 912 pre-treatment and 844 post-treatment). The analyses were conducted using cross-sectional data to test for non-invariance of measures of the TTM constructs: readiness, confidence, temptation, and processes of change (Structural Equation Modeling, SEM) across restricted and unrestricted settings. Two restricted (jail and aftercare) and one unrestricted group (outpatient) entering treatment and one restricted (prison) and two unrestricted groups (aftercare and outpatient) at end-of-treatment were contrasted. In addition TTM end-of-treatment profiles were tested as predictors of 12 month drinking outcomes (Profile Analysis). Although SEM did not indicate structural differences in the overall TTM construct model across setting types, there were factor structure differences on the confidence and temptation constructs at pre-treatment and in the factor structure of the behavioral processes at the end-of-treatment. For pre-treatment temptation and confidence, differences were found in the social situations factor loadings and in the variance for the confidence and temptation latent factors. For the end-of-treatment behavioral processes, differences across the restricted and unrestricted settings were identified in the counter-conditioning and stimulus control factor loadings. The TTM end-of-treatment profiles were not predictive of drinking outcomes in the prison sample. Both pre and post-treatment differences in structure across setting types involved constructs operationalized with behaviors that are limited for those in restricted settings. These studies suggest the TTM is a viable model for explicating addictive behavior change in restricted settings but calls for modification of subscale items that refer to specific behaviors and caution in interpreting the mean differences across setting types for problem drinkers. ^


Machine learning techniques are used for extracting valuable knowledge from data. Nowa¬days, these techniques are becoming even more important due to the evolution in data ac¬quisition and storage, which is leading to data with different characteristics that must be exploited. Therefore, advances in data collection must be accompanied with advances in machine learning techniques to solve new challenges that might arise, on both academic and real applications. There are several machine learning techniques depending on both data characteristics and purpose. Unsupervised classification or clustering is one of the most known techniques when data lack of supervision (unlabeled data) and the aim is to discover data groups (clusters) according to their similarity. On the other hand, supervised classification needs data with supervision (labeled data) and its aim is to make predictions about labels of new data. The presence of data labels is a very important characteristic that guides not only the learning task but also other related tasks such as validation. When only some of the available data are labeled whereas the others remain unlabeled (partially labeled data), neither clustering nor supervised classification can be used. This scenario, which is becoming common nowadays because of labeling process ignorance or cost, is tackled with semi-supervised learning techniques. This thesis focuses on the branch of semi-supervised learning closest to clustering, i.e., to discover clusters using available labels as support to guide and improve the clustering process. Another important data characteristic, different from the presence of data labels, is the relevance or not of data features. Data are characterized by features, but it is possible that not all of them are relevant, or equally relevant, for the learning process. A recent clustering tendency, related to data relevance and called subspace clustering, claims that different clusters might be described by different feature subsets. This differs from traditional solutions to data relevance problem, where a single feature subset (usually the complete set of original features) is found and used to perform the clustering process. The proximity of this work to clustering leads to the first goal of this thesis. As commented above, clustering validation is a difficult task due to the absence of data labels. Although there are many indices that can be used to assess the quality of clustering solutions, these validations depend on clustering algorithms and data characteristics. Hence, in the first goal three known clustering algorithms are used to cluster data with outliers and noise, to critically study how some of the most known validation indices behave. The main goal of this work is however to combine semi-supervised clustering with subspace clustering to obtain clustering solutions that can be correctly validated by using either known indices or expert opinions. Two different algorithms are proposed from different points of view to discover clusters characterized by different subspaces. For the first algorithm, available data labels are used for searching for subspaces firstly, before searching for clusters. This algorithm assigns each instance to only one cluster (hard clustering) and is based on mapping known labels to subspaces using supervised classification techniques. Subspaces are then used to find clusters using traditional clustering techniques. The second algorithm uses available data labels to search for subspaces and clusters at the same time in an iterative process. This algorithm assigns each instance to each cluster based on a membership probability (soft clustering) and is based on integrating known labels and the search for subspaces into a model-based clustering approach. The different proposals are tested using different real and synthetic databases, and comparisons to other methods are also included when appropriate. Finally, as an example of real and current application, different machine learning tech¬niques, including one of the proposals of this work (the most sophisticated one) are applied to a task of one of the most challenging biological problems nowadays, the human brain model¬ing. Specifically, expert neuroscientists do not agree with a neuron classification for the brain cortex, which makes impossible not only any modeling attempt but also the day-to-day work without a common way to name neurons. Therefore, machine learning techniques may help to get an accepted solution to this problem, which can be an important milestone for future research in neuroscience. Resumen Las técnicas de aprendizaje automático se usan para extraer información valiosa de datos. Hoy en día, la importancia de estas técnicas está siendo incluso mayor, debido a que la evolución en la adquisición y almacenamiento de datos está llevando a datos con diferentes características que deben ser explotadas. Por lo tanto, los avances en la recolección de datos deben ir ligados a avances en las técnicas de aprendizaje automático para resolver nuevos retos que pueden aparecer, tanto en aplicaciones académicas como reales. Existen varias técnicas de aprendizaje automático dependiendo de las características de los datos y del propósito. La clasificación no supervisada o clustering es una de las técnicas más conocidas cuando los datos carecen de supervisión (datos sin etiqueta), siendo el objetivo descubrir nuevos grupos (agrupaciones) dependiendo de la similitud de los datos. Por otra parte, la clasificación supervisada necesita datos con supervisión (datos etiquetados) y su objetivo es realizar predicciones sobre las etiquetas de nuevos datos. La presencia de las etiquetas es una característica muy importante que guía no solo el aprendizaje sino también otras tareas relacionadas como la validación. Cuando solo algunos de los datos disponibles están etiquetados, mientras que el resto permanece sin etiqueta (datos parcialmente etiquetados), ni el clustering ni la clasificación supervisada se pueden utilizar. Este escenario, que está llegando a ser común hoy en día debido a la ignorancia o el coste del proceso de etiquetado, es abordado utilizando técnicas de aprendizaje semi-supervisadas. Esta tesis trata la rama del aprendizaje semi-supervisado más cercana al clustering, es decir, descubrir agrupaciones utilizando las etiquetas disponibles como apoyo para guiar y mejorar el proceso de clustering. Otra característica importante de los datos, distinta de la presencia de etiquetas, es la relevancia o no de los atributos de los datos. Los datos se caracterizan por atributos, pero es posible que no todos ellos sean relevantes, o igualmente relevantes, para el proceso de aprendizaje. Una tendencia reciente en clustering, relacionada con la relevancia de los datos y llamada clustering en subespacios, afirma que agrupaciones diferentes pueden estar descritas por subconjuntos de atributos diferentes. Esto difiere de las soluciones tradicionales para el problema de la relevancia de los datos, en las que se busca un único subconjunto de atributos (normalmente el conjunto original de atributos) y se utiliza para realizar el proceso de clustering. La cercanía de este trabajo con el clustering lleva al primer objetivo de la tesis. Como se ha comentado previamente, la validación en clustering es una tarea difícil debido a la ausencia de etiquetas. Aunque existen muchos índices que pueden usarse para evaluar la calidad de las soluciones de clustering, estas validaciones dependen de los algoritmos de clustering utilizados y de las características de los datos. Por lo tanto, en el primer objetivo tres conocidos algoritmos se usan para agrupar datos con valores atípicos y ruido para estudiar de forma crítica cómo se comportan algunos de los índices de validación más conocidos. El objetivo principal de este trabajo sin embargo es combinar clustering semi-supervisado con clustering en subespacios para obtener soluciones de clustering que puedan ser validadas de forma correcta utilizando índices conocidos u opiniones expertas. Se proponen dos algoritmos desde dos puntos de vista diferentes para descubrir agrupaciones caracterizadas por diferentes subespacios. Para el primer algoritmo, las etiquetas disponibles se usan para bus¬car en primer lugar los subespacios antes de buscar las agrupaciones. Este algoritmo asigna cada instancia a un único cluster (hard clustering) y se basa en mapear las etiquetas cono-cidas a subespacios utilizando técnicas de clasificación supervisada. El segundo algoritmo utiliza las etiquetas disponibles para buscar de forma simultánea los subespacios y las agru¬paciones en un proceso iterativo. Este algoritmo asigna cada instancia a cada cluster con una probabilidad de pertenencia (soft clustering) y se basa en integrar las etiquetas conocidas y la búsqueda en subespacios dentro de clustering basado en modelos. Las propuestas son probadas utilizando diferentes bases de datos reales y sintéticas, incluyendo comparaciones con otros métodos cuando resulten apropiadas. Finalmente, a modo de ejemplo de una aplicación real y actual, se aplican diferentes técnicas de aprendizaje automático, incluyendo una de las propuestas de este trabajo (la más sofisticada) a una tarea de uno de los problemas biológicos más desafiantes hoy en día, el modelado del cerebro humano. Específicamente, expertos neurocientíficos no se ponen de acuerdo en una clasificación de neuronas para la corteza cerebral, lo que imposibilita no sólo cualquier intento de modelado sino también el trabajo del día a día al no tener una forma estándar de llamar a las neuronas. Por lo tanto, las técnicas de aprendizaje automático pueden ayudar a conseguir una solución aceptada para este problema, lo cual puede ser un importante hito para investigaciones futuras en neurociencia.


In developing neural network techniques for real world applications it is still very rare to see estimates of confidence placed on the neural network predictions. This is a major deficiency, especially in safety-critical systems. In this paper we explore three distinct methods of producing point-wise confidence intervals using neural networks. We compare and contrast Bayesian, Gaussian Process and Predictive error bars evaluated on real data. The problem domain is concerned with the calibration of a real automotive engine management system for both air-fuel ratio determination and on-line ignition timing. This problem requires real-time control and is a good candidate for exploring the use of confidence predictions due to its safety-critical nature.


Interestingness in Association Rules has been a major topic of research in the past decade. The reason is that the strength of association rules, i.e. its ability to discover ALL patterns given some thresholds on support and confidence, is also its weakness. Indeed, a typical association rules analysis on real data often results in hundreds or thousands of patterns creating a data mining problem of the second order. In other words, it is not straightforward to determine which of those rules are interesting for the end-user. This paper provides an overview of some existing measures of interestingness and we will comment on their properties. In general, interestingness measures can be divided into objective and subjective measures. Objective measures tend to express interestingness by means of statistical or mathematical criteria, whereas subjective measures of interestingness aim at capturing more practical criteria that should be taken into account, such as unexpectedness or actionability of rules. This paper only focusses on objective measures of interestingness.


Gravity surveying is challenging in Antarctica because of its hostile environment and inaccessibility. Nevertheless, many ground-based, airborne and shipborne gravity campaigns have been completed by the geophysical and geodetic communities since the 1980s. We present the first modern Antarctic-wide gravity data compilation derived from 13 million data points covering an area of 10 million km**2, which corresponds to 73% coverage of the continent. The remove-compute-restore technique was applied for gridding, which facilitated levelling of the different gravity datasets with respect to an Earth Gravity Model derived from satellite data alone. The resulting free-air and Bouguer gravity anomaly grids of 10 km resolution are publicly available. These grids will enable new high-resolution combined Earth Gravity Models to be derived and represent a major step forward towards solving the geodetic polar data gap problem. They provide a new tool to investigate continental-scale lithospheric structure and geological evolution of Antarctica.