811 resultados para Data-driven analysis
Resumo:
The objectives of this study were to identify and measure the average outcomes of the Open Door Mission's nine-month community-based substance abuse treatment program, identify predictors of successful outcomes, and make recommendations to the Open Door Mission for improving its treatment program.^ The Mission's program is exclusive to adult men who have limited financial resources: most of which were homeless or dependent on parents or other family members for basic living needs. Many, but not all, of these men are either chemically dependent or have a history of substance abuse.^ This study tracked a cohort of the Mission's graduates throughout this one-year study and identified various indicators of success at short-term intervals, which may be predictive of longer-term outcomes. We tracked various levels of 12-step program involvement, as well as other social and spiritual activities, such as church affiliation and recovery support.^ Twenty-four of the 66 subjects, or 36% met the Mission's requirements for success. Specific to this success criteria; Fifty-four, or 82% reported affiliation with a home church; Twenty-six, or 39% reported full-time employment; Sixty-one, or 92% did not report or were not identified as having any post-treatment arrests or incarceration, and; Forty, or 61% reported continuous abstinence from both drugs and alcohol.^ Five research-based hypotheses were developed and tested. The primary analysis tool was the web-based non-parametric dependency modeling tool, B-Course, which revealed some strong associations with certain variables, and helped the researchers generate and test several data-driven hypotheses. Full-time employment is the greatest predictor of abstinence: 95% of those who reported full time employment also reported continuous post-treatment abstinence, while 50% of those working part-time were abstinent and 29% of those with no employment were abstinent. Working with a 12-step sponsor, attending aftercare, and service with others were identified as predictors of abstinence.^ This study demonstrates that associations with abstinence and the ODM success criteria are not simply based on one social or behavioral factor. Rather, these relationships are interdependent, and show that abstinence is achieved and maintained through a combination of several 12-step recovery activities. This study used a simple assessment methodology, which demonstrated strong associations across variables and outcomes, which have practical applicability to the Open Door Mission for improving its treatment program. By leveraging the predictive capability of the various success determination methodologies discussed and developed throughout this study, we can identify accurate outcomes with both validity and reliability. This assessment instrument can also be used as an intervention that, if operationalized to the Mission’s clients during the primary treatment program, may measurably improve the effectiveness and outcomes of the Open Door Mission.^
Resumo:
Complex diseases such as cancer result from multiple genetic changes and environmental exposures. Due to the rapid development of genotyping and sequencing technologies, we are now able to more accurately assess causal effects of many genetic and environmental factors. Genome-wide association studies have been able to localize many causal genetic variants predisposing to certain diseases. However, these studies only explain a small portion of variations in the heritability of diseases. More advanced statistical models are urgently needed to identify and characterize some additional genetic and environmental factors and their interactions, which will enable us to better understand the causes of complex diseases. In the past decade, thanks to the increasing computational capabilities and novel statistical developments, Bayesian methods have been widely applied in the genetics/genomics researches and demonstrating superiority over some regular approaches in certain research areas. Gene-environment and gene-gene interaction studies are among the areas where Bayesian methods may fully exert its functionalities and advantages. This dissertation focuses on developing new Bayesian statistical methods for data analysis with complex gene-environment and gene-gene interactions, as well as extending some existing methods for gene-environment interactions to other related areas. It includes three sections: (1) Deriving the Bayesian variable selection framework for the hierarchical gene-environment and gene-gene interactions; (2) Developing the Bayesian Natural and Orthogonal Interaction (NOIA) models for gene-environment interactions; and (3) extending the applications of two Bayesian statistical methods which were developed for gene-environment interaction studies, to other related types of studies such as adaptive borrowing historical data. We propose a Bayesian hierarchical mixture model framework that allows us to investigate the genetic and environmental effects, gene by gene interactions (epistasis) and gene by environment interactions in the same model. It is well known that, in many practical situations, there exists a natural hierarchical structure between the main effects and interactions in the linear model. Here we propose a model that incorporates this hierarchical structure into the Bayesian mixture model, such that the irrelevant interaction effects can be removed more efficiently, resulting in more robust, parsimonious and powerful models. We evaluate both of the 'strong hierarchical' and 'weak hierarchical' models, which specify that both or one of the main effects between interacting factors must be present for the interactions to be included in the model. The extensive simulation results show that the proposed strong and weak hierarchical mixture models control the proportion of false positive discoveries and yield a powerful approach to identify the predisposing main effects and interactions in the studies with complex gene-environment and gene-gene interactions. We also compare these two models with the 'independent' model that does not impose this hierarchical constraint and observe their superior performances in most of the considered situations. The proposed models are implemented in the real data analysis of gene and environment interactions in the cases of lung cancer and cutaneous melanoma case-control studies. The Bayesian statistical models enjoy the properties of being allowed to incorporate useful prior information in the modeling process. Moreover, the Bayesian mixture model outperforms the multivariate logistic model in terms of the performances on the parameter estimation and variable selection in most cases. Our proposed models hold the hierarchical constraints, that further improve the Bayesian mixture model by reducing the proportion of false positive findings among the identified interactions and successfully identifying the reported associations. This is practically appealing for the study of investigating the causal factors from a moderate number of candidate genetic and environmental factors along with a relatively large number of interactions. The natural and orthogonal interaction (NOIA) models of genetic effects have previously been developed to provide an analysis framework, by which the estimates of effects for a quantitative trait are statistically orthogonal regardless of the existence of Hardy-Weinberg Equilibrium (HWE) within loci. Ma et al. (2012) recently developed a NOIA model for the gene-environment interaction studies and have shown the advantages of using the model for detecting the true main effects and interactions, compared with the usual functional model. In this project, we propose a novel Bayesian statistical model that combines the Bayesian hierarchical mixture model with the NOIA statistical model and the usual functional model. The proposed Bayesian NOIA model demonstrates more power at detecting the non-null effects with higher marginal posterior probabilities. Also, we review two Bayesian statistical models (Bayesian empirical shrinkage-type estimator and Bayesian model averaging), which were developed for the gene-environment interaction studies. Inspired by these Bayesian models, we develop two novel statistical methods that are able to handle the related problems such as borrowing data from historical studies. The proposed methods are analogous to the methods for the gene-environment interactions on behalf of the success on balancing the statistical efficiency and bias in a unified model. By extensive simulation studies, we compare the operating characteristics of the proposed models with the existing models including the hierarchical meta-analysis model. The results show that the proposed approaches adaptively borrow the historical data in a data-driven way. These novel models may have a broad range of statistical applications in both of genetic/genomic and clinical studies.
Resumo:
The analysis of research data plays a key role in data-driven areas of science. Varieties of mixed research data sets exist and scientists aim to derive or validate hypotheses to find undiscovered knowledge. Many analysis techniques identify relations of an entire dataset only. This may level the characteristic behavior of different subgroups in the data. Like automatic subspace clustering, we aim at identifying interesting subgroups and attribute sets. We present a visual-interactive system that supports scientists to explore interesting relations between aggregated bins of multivariate attributes in mixed data sets. The abstraction of data to bins enables the application of statistical dependency tests as the measure of interestingness. An overview matrix view shows all attributes, ranked with respect to the interestingness of bins. Complementary, a node-link view reveals multivariate bin relations by positioning dependent bins close to each other. The system supports information drill-down based on both expert knowledge and algorithmic support. Finally, visual-interactive subset clustering assigns multivariate bin relations to groups. A list-based cluster result representation enables the scientist to communicate multivariate findings at a glance. We demonstrate the applicability of the system with two case studies from the earth observation domain and the prostate cancer research domain. In both cases, the system enabled us to identify the most interesting multivariate bin relations, to validate already published results, and, moreover, to discover unexpected relations.
Resumo:
Abstract. The uptake of Linked Data (LD) has promoted the proliferation of datasets and their associated ontologies for describing different domains. Ac-cording to LD principles, developers should reuse as many available terms as possible to describe their data. Importing ontologies or referring to their terms’ URIs are the two main ways to reuse knowledge from available ontologies. In this paper, we have analyzed 18589 terms appearing within 196 ontologies in-cluded in the Linked Open Vocabularies (LOV) registry with the aim of under-standing the current state of ontology reuse in the LD context. In order to char-acterize the landscape of ontology reuse in this context, we have extracted sta-tistics about currently reused elements, calculated ratios for reuse, and drawn graphs about imports and references between ontologies. Keywords: ontology, vocabulary, reuse, linked data, ontology import
Resumo:
Abstract interpretation-based data-flow analysis of logic programs is, at this point, relatively well understood from the point of view of general frameworks and abstract domains. On the other hand, comparatively little attention has been given to the problems which arise when analysis of a full, practical dialect of the Prolog language is attempted, and only few solutions to these problems have been proposed to date. Existing proposals generally restrict in one way or another the classes of programs which can be analyzed. This paper attempts to fill this gap by considering a full dialect of Prolog, essentially the recent ISO standard, pointing out the problems that may arise in the analysis of such a dialect, and proposing a combination of known and novel solutions that together allow the correct analysis of arbitrary programs which use the full power of the language.
Resumo:
The uptake of Linked Data (LD) has promoted the proliferation of datasets and their associated ontologies for describing different domains. Par-ticular LD development characteristics such as agility and web-based architec-ture necessitate the revision, adaption, and lightening of existing methodologies for ontology development. This thesis proposes a lightweight method for ontol-ogy development in an LD context which will be based in data-driven agile de-velopments, existing resources to be reused, and the evaluation of the obtained products considering both classical ontological engineering principles and LD characteristics.
Resumo:
Global data-flow analysis of (constraint) logic programs, which is generally based on abstract interpretation [7], is reaching a comparatively high level of maturity. A natural question is whether it is time for its routine incorporation in standard compilers, something which, beyond a few experimental systems, has not happened to date. Such incorporation arguably makes good sense only if: • the range of applications of global analysis is large enough to justify the additional complication in the compiler, and • global analysis technology can deal with all the features of "practical" languages (e.g., the ISO-Prolog built-ins) and "scales up" for large programs. We present a tutorial overview of a number of concepts and techniques directly related to the issues above, with special emphasis on the first one. In particular, we concéntrate on novel uses of global analysis during program development and debugging, rather than on the more traditional application área of program optimization. The idea of using abstract interpretation for validation and diagnosis has been studied in the context of imperative programming [2] and also of logic programming. The latter work includes issues such as using approximations to reduce the burden posed on programmers by declarative debuggers [6, 3] and automatically generating and checking assertions [4, 5] (which includes the more traditional type checking of strongly typed languages, such as Gódel or Mercury [1, 8, 9]) We also review some solutions for scalability including modular analysis, incremental analysis, and widening. Finally, we discuss solutions for dealing with meta-predicates, side-effects, delay declarations, constraints, dynamic predicates, and other such features which may appear in practical languages. In the discussion we will draw both from the literature and from our experience and that of others in the development and use of the CIAO system analyzer. In order to emphasize the practical aspects of the solutions discussed, the presentation of several concepts will be illustrated by examples run on the CIAO system, which makes extensive use of global analysis and assertions.
Resumo:
System identification deals with the problem of building mathematical models of dynamical systems based on observed data from the system" [1]. In the context of civil engineering, the system refers to a large scale structure such as a building, bridge, or an offshore structure, and identification mostly involves the determination of modal parameters (the natural frequencies, damping ratios, and mode shapes). This paper presents some modal identification results obtained using a state-of-the-art time domain system identification method (data-driven stochastic subspace algorithms [2]) applied to the output-only data measured in a steel arch bridge. First, a three dimensional finite element model was developed for the numerical analysis of the structure using ANSYS. Modal analysis was carried out and modal parameters were extracted in the frequency range of interest, 0-10 Hz. The results obtained from the finite element modal analysis were used to determine the location of the sensors. After that, ambient vibration tests were conducted during April 23-24, 2009. The response of the structure was measured using eight accelerometers. Two stations of three sensors were formed (triaxial stations). These sensors were held stationary for reference during the test. The two remaining sensors were placed at the different measurement points along the bridge deck, in which only vertical and transversal measurements were conducted (biaxial stations). Point estimate and interval estimate have been carried out in the state space model using these ambient vibration measurements. In the case of parametric models (like state space), the dynamic behaviour of a system is described using mathematical models. Then, mathematical relationships can be established between modal parameters and estimated point parameters (thus, it is common to use experimental modal analysis as a synonym for system identification). Stable modal parameters are found using a stabilization diagram. Furthermore, this paper proposes a method for assessing the precision of estimates of the parameters of state-space models (confidence interval). This approach employs the nonparametric bootstrap procedure [3] and is applied to subspace parameter estimation algorithm. Using bootstrap results, a plot similar to a stabilization diagram is developed. These graphics differentiate system modes from spurious noise modes for a given order system. Additionally, using the modal assurance criterion, the experimental modes obtained have been compared with those evaluated from a finite element analysis. A quite good agreement between numerical and experimental results is observed.
Resumo:
During the first decade of the new millennium, fueled by the economic development in Spain, urban bus services were extended. Since the years 2008 and 2009, the root of the economic crisis, the improvement of these services is at risk due to economic problems. In this paper, the technical efficiency of the main urban bus companies in Spain during the 2004–2009 period are studied using SBM (slack-based measures) models and by establishing the slacks in the services' production inputs. The influence of a series of exogenous variables on the operation of the different services is also analyzed. It is concluded that only the 24% of the case studies are efficient, and some urban form variables can explain part of the inefficiency. The methodology used allows studying the inefficiency in a disaggregated way that other DEA (data envelopment analysis) models do not.
Resumo:
Nuestro cerebro contiene cerca de 1014 sinapsis neuronales. Esta enorme cantidad de conexiones proporciona un entorno ideal donde distintos grupos de neuronas se sincronizan transitoriamente para provocar la aparición de funciones cognitivas, como la percepción, el aprendizaje o el pensamiento. Comprender la organización de esta compleja red cerebral en base a datos neurofisiológicos, representa uno de los desafíos más importantes y emocionantes en el campo de la neurociencia. Se han propuesto recientemente varias medidas para evaluar cómo se comunican las diferentes partes del cerebro a diversas escalas (células individuales, columnas corticales, o áreas cerebrales). Podemos clasificarlos, según su simetría, en dos grupos: por una parte, la medidas simétricas, como la correlación, la coherencia o la sincronización de fase, que evalúan la conectividad funcional (FC); mientras que las medidas asimétricas, como la causalidad de Granger o transferencia de entropía, son capaces de detectar la dirección de la interacción, lo que denominamos conectividad efectiva (EC). En la neurociencia moderna ha aumentado el interés por el estudio de las redes funcionales cerebrales, en gran medida debido a la aparición de estos nuevos algoritmos que permiten analizar la interdependencia entre señales temporales, además de la emergente teoría de redes complejas y la introducción de técnicas novedosas, como la magnetoencefalografía (MEG), para registrar datos neurofisiológicos con gran resolución. Sin embargo, nos hallamos ante un campo novedoso que presenta aun varias cuestiones metodológicas sin resolver, algunas de las cuales trataran de abordarse en esta tesis. En primer lugar, el creciente número de aproximaciones para determinar la existencia de FC/EC entre dos o más señales temporales, junto con la complejidad matemática de las herramientas de análisis, hacen deseable organizarlas todas en un paquete software intuitivo y fácil de usar. Aquí presento HERMES (http://hermes.ctb.upm.es), una toolbox en MatlabR, diseñada precisamente con este fin. Creo que esta herramienta será de gran ayuda para todos aquellos investigadores que trabajen en el campo emergente del análisis de conectividad cerebral y supondrá un gran valor para la comunidad científica. La segunda cuestión practica que se aborda es el estudio de la sensibilidad a las fuentes cerebrales profundas a través de dos tipos de sensores MEG: gradiómetros planares y magnetómetros, esta aproximación además se combina con un enfoque metodológico, utilizando dos índices de sincronización de fase: phase locking value (PLV) y phase lag index (PLI), este ultimo menos sensible a efecto la conducción volumen. Por lo tanto, se compara su comportamiento al estudiar las redes cerebrales, obteniendo que magnetómetros y PLV presentan, respectivamente, redes más densamente conectadas que gradiómetros planares y PLI, por los valores artificiales que crea el problema de la conducción de volumen. Sin embargo, cuando se trata de caracterizar redes epilépticas, el PLV ofrece mejores resultados, debido a la gran dispersión de las redes obtenidas con PLI. El análisis de redes complejas ha proporcionado nuevos conceptos que mejoran caracterización de la interacción de sistemas dinámicos. Se considera que una red está compuesta por nodos, que simbolizan sistemas, cuyas interacciones se representan por enlaces, y su comportamiento y topología puede caracterizarse por un elevado número de medidas. Existe evidencia teórica y empírica de que muchas de ellas están fuertemente correlacionadas entre sí. Por lo tanto, se ha conseguido seleccionar un pequeño grupo que caracteriza eficazmente estas redes, y condensa la información redundante. Para el análisis de redes funcionales, la selección de un umbral adecuado para decidir si un determinado valor de conectividad de la matriz de FC es significativo y debe ser incluido para un análisis posterior, se convierte en un paso crucial. En esta tesis, se han obtenido resultados más precisos al utilizar un test de subrogadas, basado en los datos, para evaluar individualmente cada uno de los enlaces, que al establecer a priori un umbral fijo para la densidad de conexiones. Finalmente, todas estas cuestiones se han aplicado al estudio de la epilepsia, caso práctico en el que se analizan las redes funcionales MEG, en estado de reposo, de dos grupos de pacientes epilépticos (generalizada idiopática y focal frontal) en comparación con sujetos control sanos. La epilepsia es uno de los trastornos neurológicos más comunes, con más de 55 millones de afectados en el mundo. Esta enfermedad se caracteriza por la predisposición a generar ataques epilépticos de actividad neuronal anormal y excesiva o bien síncrona, y por tanto, es el escenario perfecto para este tipo de análisis al tiempo que presenta un gran interés tanto desde el punto de vista clínico como de investigación. Los resultados manifiestan alteraciones especificas en la conectividad y un cambio en la topología de las redes en cerebros epilépticos, desplazando la importancia del ‘foco’ a la ‘red’, enfoque que va adquiriendo relevancia en las investigaciones recientes sobre epilepsia. ABSTRACT There are about 1014 neuronal synapses in the human brain. This huge number of connections provides the substrate for neuronal ensembles to become transiently synchronized, producing the emergence of cognitive functions such as perception, learning or thinking. Understanding the complex brain network organization on the basis of neuroimaging data represents one of the most important and exciting challenges for systems neuroscience. Several measures have been recently proposed to evaluate at various scales (single cells, cortical columns, or brain areas) how the different parts of the brain communicate. We can classify them, according to their symmetry, into two groups: symmetric measures, such as correlation, coherence or phase synchronization indexes, evaluate functional connectivity (FC); and on the other hand, the asymmetric ones, such as Granger causality or transfer entropy, are able to detect effective connectivity (EC) revealing the direction of the interaction. In modern neurosciences, the interest in functional brain networks has increased strongly with the onset of new algorithms to study interdependence between time series, the advent of modern complex network theory and the introduction of powerful techniques to record neurophysiological data, such as magnetoencephalography (MEG). However, when analyzing neurophysiological data with this approach several questions arise. In this thesis, I intend to tackle some of the practical open problems in the field. First of all, the increase in the number of time series analysis algorithms to study brain FC/EC, along with their mathematical complexity, creates the necessity of arranging them into a single, unified toolbox that allow neuroscientists, neurophysiologists and researchers from related fields to easily access and make use of them. I developed such a toolbox for this aim, it is named HERMES (http://hermes.ctb.upm.es), and encompasses several of the most common indexes for the assessment of FC and EC running for MatlabR environment. I believe that this toolbox will be very helpful to all the researchers working in the emerging field of brain connectivity analysis and will entail a great value for the scientific community. The second important practical issue tackled in this thesis is the evaluation of the sensitivity to deep brain sources of two different MEG sensors: planar gradiometers and magnetometers, in combination with the related methodological approach, using two phase synchronization indexes: phase locking value (PLV) y phase lag index (PLI), the latter one being less sensitive to volume conduction effect. Thus, I compared their performance when studying brain networks, obtaining that magnetometer sensors and PLV presented higher artificial values as compared with planar gradiometers and PLI respectively. However, when it came to characterize epileptic networks it was the PLV which gives better results, as PLI FC networks where very sparse. Complex network analysis has provided new concepts which improved characterization of interacting dynamical systems. With this background, networks could be considered composed of nodes, symbolizing systems, whose interactions with each other are represented by edges. A growing number of network measures is been applied in network analysis. However, there is theoretical and empirical evidence that many of these indexes are strongly correlated with each other. Therefore, in this thesis I reduced them to a small set, which could more efficiently characterize networks. Within this framework, selecting an appropriate threshold to decide whether a certain connectivity value of the FC matrix is significant and should be included in the network analysis becomes a crucial step, in this thesis, I used the surrogate data tests to make an individual data-driven evaluation of each of the edges significance and confirmed more accurate results than when just setting to a fixed value the density of connections. All these methodologies were applied to the study of epilepsy, analysing resting state MEG functional networks, in two groups of epileptic patients (generalized and focal epilepsy) that were compared to matching control subjects. Epilepsy is one of the most common neurological disorders, with more than 55 million people affected worldwide, characterized by its predisposition to generate epileptic seizures of abnormal excessive or synchronous neuronal activity, and thus, this scenario and analysis, present a great interest from both the clinical and the research perspective. Results revealed specific disruptions in connectivity and network topology and evidenced that networks’ topology is changed in epileptic brains, supporting the shift from ‘focus’ to ‘networks’ which is gaining importance in modern epilepsy research.
Resumo:
We propose a new Bayesian framework for automatically determining the position (location and orientation) of an uncalibrated camera using the observations of moving objects and a schematic map of the passable areas of the environment. Our approach takes advantage of static and dynamic information on the scene structures through prior probability distributions for object dynamics. The proposed approach restricts plausible positions where the sensor can be located while taking into account the inherent ambiguity of the given setting. The proposed framework samples from the posterior probability distribution for the camera position via data driven MCMC, guided by an initial geometric analysis that restricts the search space. A Kullback-Leibler divergence analysis is then used that yields the final camera position estimate, while explicitly isolating ambiguous settings. The proposed approach is evaluated in synthetic and real environments, showing its satisfactory performance in both ambiguous and unambiguous settings.
Resumo:
This paper analyses the relationship between productive efficiency and online-social-networks (OSN) in Spanish telecommunications firms. A data-envelopment-analysis (DEA) is used and several indicators of business ?social Media? activities are incorporated. A super-efficiency analysis and bootstrapping techniques are performed to increase the model?s robustness and accuracy. Then, a logistic regression model is applied to characterise factors and drivers of good performance in OSN. Results reveal the company?s ability to absorb and utilise OSNs as a key factor in improving the productive efficiency. This paper presents a model for assessing the strategic performance of the presence and activity in OSN.
Resumo:
Durante la actividad diaria, la sociedad actual interactúa constantemente por medio de dispositivos electrónicos y servicios de telecomunicaciones, tales como el teléfono, correo electrónico, transacciones bancarias o redes sociales de Internet. Sin saberlo, masivamente dejamos rastros de nuestra actividad en las bases de datos de empresas proveedoras de servicios. Estas nuevas fuentes de datos tienen las dimensiones necesarias para que se puedan observar patrones de comportamiento humano a grandes escalas. Como resultado, ha surgido una reciente explosión sin precedentes de estudios de sistemas sociales, dirigidos por el análisis de datos y procesos computacionales. En esta tesis desarrollamos métodos computacionales y matemáticos para analizar sistemas sociales por medio del estudio combinado de datos derivados de la actividad humana y la teoría de redes complejas. Nuestro objetivo es caracterizar y entender los sistemas emergentes de interacciones sociales en los nuevos espacios tecnológicos, tales como la red social Twitter y la telefonía móvil. Analizamos los sistemas por medio de la construcción de redes complejas y series temporales, estudiando su estructura, funcionamiento y evolución en el tiempo. También, investigamos la naturaleza de los patrones observados por medio de los mecanismos que rigen las interacciones entre individuos, así como medimos el impacto de eventos críticos en el comportamiento del sistema. Para ello, hemos propuesto modelos que explican las estructuras globales y la dinámica emergente con que fluye la información en el sistema. Para los estudios de la red social Twitter, hemos basado nuestros análisis en conversaciones puntuales, tales como protestas políticas, grandes acontecimientos o procesos electorales. A partir de los mensajes de las conversaciones, identificamos a los usuarios que participan y construimos redes de interacciones entre los mismos. Específicamente, construimos una red para representar quién recibe los mensajes de quién y otra red para representar quién propaga los mensajes de quién. En general, hemos encontrado que estas estructuras tienen propiedades complejas, tales como crecimiento explosivo y distribuciones de grado libres de escala. En base a la topología de estas redes, hemos indentificado tres tipos de usuarios que determinan el flujo de información según su actividad e influencia. Para medir la influencia de los usuarios en las conversaciones, hemos introducido una nueva medida llamada eficiencia de usuario. La eficiencia se define como el número de retransmisiones obtenidas por mensaje enviado, y mide los efectos que tienen los esfuerzos individuales sobre la reacción colectiva. Hemos observado que la distribución de esta propiedad es ubicua en varias conversaciones de Twitter, sin importar sus dimensiones ni contextos. Con lo cual, sugerimos que existe universalidad en la relación entre esfuerzos individuales y reacciones colectivas en Twitter. Para explicar los factores que determinan la emergencia de la distribución de eficiencia, hemos desarrollado un modelo computacional que simula la propagación de mensajes en la red social de Twitter, basado en el mecanismo de cascadas independientes. Este modelo nos permite medir el efecto que tienen sobre la distribución de eficiencia, tanto la topología de la red social subyacente, como la forma en que los usuarios envían mensajes. Los resultados indican que la emergencia de un grupo selecto de usuarios altamente eficientes depende de la heterogeneidad de la red subyacente y no del comportamiento individual. Por otro lado, hemos desarrollado técnicas para inferir el grado de polarización política en redes sociales. Proponemos una metodología para estimar opiniones en redes sociales y medir el grado de polarización en las opiniones obtenidas. Hemos diseñado un modelo donde estudiamos el efecto que tiene la opinión de un pequeño grupo de usuarios influyentes, llamado élite, sobre las opiniones de la mayoría de usuarios. El modelo da como resultado una distribución de opiniones sobre la cual medimos el grado de polarización. Aplicamos nuestra metodología para medir la polarización en redes de difusión de mensajes, durante una conversación en Twitter de una sociedad políticamente polarizada. Los resultados obtenidos presentan una alta correspondencia con los datos offline. Con este estudio, hemos demostrado que la metodología propuesta es capaz de determinar diferentes grados de polarización dependiendo de la estructura de la red. Finalmente, hemos estudiado el comportamiento humano a partir de datos de telefonía móvil. Por una parte, hemos caracterizado el impacto que tienen desastres naturales, como innundaciones, sobre el comportamiento colectivo. Encontramos que los patrones de comunicación se alteran de forma abrupta en las áreas afectadas por la catástofre. Con lo cual, demostramos que se podría medir el impacto en la región casi en tiempo real y sin necesidad de desplegar esfuerzos en el terreno. Por otra parte, hemos estudiado los patrones de actividad y movilidad humana para caracterizar las interacciones entre regiones de un país en desarrollo. Encontramos que las redes de llamadas y trayectorias humanas tienen estructuras de comunidades asociadas a regiones y centros urbanos. En resumen, hemos mostrado que es posible entender procesos sociales complejos por medio del análisis de datos de actividad humana y la teoría de redes complejas. A lo largo de la tesis, hemos comprobado que fenómenos sociales como la influencia, polarización política o reacción a eventos críticos quedan reflejados en los patrones estructurales y dinámicos que presentan la redes construidas a partir de datos de conversaciones en redes sociales de Internet o telefonía móvil. ABSTRACT During daily routines, we are constantly interacting with electronic devices and telecommunication services. Unconsciously, we are massively leaving traces of our activity in the service providers’ databases. These new data sources have the dimensions required to enable the observation of human behavioral patterns at large scales. As a result, there has been an unprecedented explosion of data-driven social research. In this thesis, we develop computational and mathematical methods to analyze social systems by means of the combined study of human activity data and the theory of complex networks. Our goal is to characterize and understand the emergent systems from human interactions on the new technological spaces, such as the online social network Twitter and mobile phones. We analyze systems by means of the construction of complex networks and temporal series, studying their structure, functioning and temporal evolution. We also investigate on the nature of the observed patterns, by means of the mechanisms that rule the interactions among individuals, as well as on the impact of critical events on the system’s behavior. For this purpose, we have proposed models that explain the global structures and the emergent dynamics of information flow in the system. In the studies of the online social network Twitter, we have based our analysis on specific conversations, such as political protests, important announcements and electoral processes. From the messages related to the conversations, we identify the participant users and build networks of interactions with them. We specifically build one network to represent whoreceives- whose-messages and another to represent who-propagates-whose-messages. In general, we have found that these structures have complex properties, such as explosive growth and scale-free degree distributions. Based on the topological properties of these networks, we have identified three types of user behavior that determine the information flow dynamics due to their influence. In order to measure the users’ influence on the conversations, we have introduced a new measure called user efficiency. It is defined as the number of retransmissions obtained by message posted, and it measures the effects of the individual activity on the collective reacixtions. We have observed that the probability distribution of this property is ubiquitous across several Twitter conversation, regardlessly of their dimension or social context. Therefore, we suggest that there is a universal behavior in the relationship between individual efforts and collective reactions on Twitter. In order to explain the different factors that determine the user efficiency distribution, we have developed a computational model to simulate the diffusion of messages on Twitter, based on the mechanism of independent cascades. This model, allows us to measure the impact on the emergent efficiency distribution of the underlying network topology, as well as the way that users post messages. The results indicate that the emergence of an exclusive group of highly efficient users depends upon the heterogeneity of the underlying network instead of the individual behavior. Moreover, we have also developed techniques to infer the degree of polarization in social networks. We propose a methodology to estimate opinions in social networks and to measure the degree of polarization in the obtained opinions. We have designed a model to study the effects of the opinions of a small group of influential users, called elite, on the opinions of the majority of users. The model results in an opinions distribution to which we measure the degree of polarization. We apply our methodology to measure the polarization on graphs from the messages diffusion process, during a conversation on Twitter from a polarized society. The results are in very good agreement with offline and contextual data. With this study, we have shown that our methodology is capable of detecting several degrees of polarization depending on the structure of the networks. Finally, we have also inferred the human behavior from mobile phones’ data. On the one hand, we have characterized the impact of natural disasters, like flooding, on the collective behavior. We found that the communication patterns are abruptly altered in the areas affected by the catastrophe. Therefore, we demonstrate that we could measure the impact of the disaster on the region, almost in real-time and without needing to deploy further efforts. On the other hand, we have studied human activity and mobility patterns in order to characterize regional interactions on a developing country. We found that the calls and trajectories networks present community structure associated to regional and urban areas. In summary, we have shown that it is possible to understand complex social processes by means of analyzing human activity data and the theory of complex networks. Along the thesis, we have demonstrated that social phenomena, like influence, polarization and reaction to critical events, are reflected in the structural and dynamical patterns of the networks constructed from data regarding conversations on online social networks and mobile phones.
Resumo:
Purely data-driven approaches for machine learning present difficulties when data are scarce relative to the complexity of the model or when the model is forced to extrapolate. On the other hand, purely mechanistic approaches need to identify and specify all the interactions in the problem at hand (which may not be feasible) and still leave the issue of how to parameterize the system. In this paper, we present a hybrid approach using Gaussian processes and differential equations to combine data-driven modeling with a physical model of the system. We show how different, physically inspired, kernel functions can be developed through sensible, simple, mechanistic assumptions about the underlying system. The versatility of our approach is illustrated with three case studies from motion capture, computational biology, and geostatistics.
Resumo:
Enterprises are increasingly using a wide range of heterogeneous information systems for executing and governing their business activities. Even if the adoption of service orientation has improved loose coupling and reusability, applications are still isolated data silos whose integration requires complex transformations and mediations. However, by leveraging Linked Data principles those data silos can now be seamlessly integrated, and this opens the door to new data-driven approaches for Enterprise Application Integration (EAI). In this paper we present LDP4j, an open souce Java-based framework for the development of interoperable read-write Linked Data applications, based on the W3C Linked Data Platform (LDP) specification.