64 resultados para Data-driven analysis
Resumo:
Detecting local differences between groups of connectomes is a great challenge in neuroimaging, because the large number of tests that have to be performed and the impact on multiplicity correction. Any available information should be exploited to increase the power of detecting true between-group effects. We present an adaptive strategy that exploits the data structure and the prior information concerning positive dependence between nodes and connections, without relying on strong assumptions. As a first step, we decompose the brain network, i.e., the connectome, into subnetworks and we apply a screening at the subnetwork level. The subnetworks are defined either according to prior knowledge or by applying a data driven algorithm. Given the results of the screening step, a filtering is performed to seek real differences at the node/connection level. The proposed strategy could be used to strongly control either the family-wise error rate or the false discovery rate. We show by means of different simulations the benefit of the proposed strategy, and we present a real application of comparing connectomes of preschool children and adolescents.
Resumo:
BACKGROUND: By analyzing human immunodeficiency virus type 1 (HIV-1) pol sequences from the Swiss HIV Cohort Study (SHCS), we explored whether the prevalence of non-B subtypes reflects domestic transmission or migration patterns. METHODS: Swiss non-B sequences and sequences collected abroad were pooled to construct maximum likelihood trees, which were analyzed for Swiss-specific subepidemics, (subtrees including ≥80% Swiss sequences, bootstrap >70%; macroscale analysis) or evidence for domestic transmission (sequence pairs with genetic distance <1.5%, bootstrap ≥98%; microscale analysis). RESULTS: Of 8287 SHCS participants, 1732 (21%) were infected with non-B subtypes, of which A (n = 328), C (n = 272), CRF01_AE (n = 258), and CRF02_AG (n = 285) were studied further. The macroscale analysis revealed that 21% (A), 16% (C), 24% (CRF01_AE), and 28% (CRF02_AG) belonged to Swiss-specific subepidemics. The microscale analysis identified 26 possible transmission pairs: 3 (12%) including only homosexual Swiss men of white ethnicity; 3 (12%) including homosexual white men from Switzerland and partners from foreign countries; and 10 (38%) involving heterosexual white Swiss men and females of different nationality and predominantly nonwhite ethnicity. CONCLUSIONS: Of all non-B infections diagnosed in Switzerland, <25% could be prevented by domestic interventions. Awareness should be raised among immigrants and Swiss individuals with partners from high prevalence countries to contain the spread of non-B subtypes.
Resumo:
TEXTABLE est un nouvel outil open source de programmation visuelle pour l'analyse de données textuelles. Les implications de la conception de ce logiciel du point de vue de l'interopérabilité et de la flexibilité sont abordées, ainsi que la question que son adéquation pour un usage pédagogique. Une brève introduction aux principes de la programmation visuelle pour l'analyse de données textuelles est également proposée.
Resumo:
This contribution introduces Data Envelopment Analysis (DEA), a performance measurement technique. DEA helps decision makers for the following reasons: (1) By calculating an efficiency score, it indicates if a firm is efficient or has capacity for improvement; (2) By setting target values for input and output, it calculates how much input must be decreased or output increased in order to become efficient; (3) By identifying the nature of returns to scale, it indicates if a firm has to decrease or increase its scale (or size) in order to minimise the average total cost; (4) By identifying a set of benchmarks, it specifies which other firms' processes need to be analysed in order to improve its own practices. This contribution presents the essentials about DEA, alongside a case study to intuitively understand its application. It also introduces Win4DEAP, a software package that conducts efficiency analysis based on DEA methodology. The methodical background of DEA is presented for more demanding readers. Finally, four advanced topics of DEA are treated: adjustment to the environment, preferences, sensitivity analysis and time series data.
Resumo:
Uncertainty quantification of petroleum reservoir models is one of the present challenges, which is usually approached with a wide range of geostatistical tools linked with statistical optimisation or/and inference algorithms. The paper considers a data driven approach in modelling uncertainty in spatial predictions. Proposed semi-supervised Support Vector Regression (SVR) model has demonstrated its capability to represent realistic features and describe stochastic variability and non-uniqueness of spatial properties. It is able to capture and preserve key spatial dependencies such as connectivity, which is often difficult to achieve with two-point geostatistical models. Semi-supervised SVR is designed to integrate various kinds of conditioning data and learn dependences from them. A stochastic semi-supervised SVR model is integrated into a Bayesian framework to quantify uncertainty with multiple models fitted to dynamic observations. The developed approach is illustrated with a reservoir case study. The resulting probabilistic production forecasts are described by uncertainty envelopes.
Resumo:
Among the types of remote sensing acquisitions, optical images are certainly one of the most widely relied upon data sources for Earth observation. They provide detailed measurements of the electromagnetic radiation reflected or emitted by each pixel in the scene. Through a process termed supervised land-cover classification, this allows to automatically yet accurately distinguish objects at the surface of our planet. In this respect, when producing a land-cover map of the surveyed area, the availability of training examples representative of each thematic class is crucial for the success of the classification procedure. However, in real applications, due to several constraints on the sample collection process, labeled pixels are usually scarce. When analyzing an image for which those key samples are unavailable, a viable solution consists in resorting to the ground truth data of other previously acquired images. This option is attractive but several factors such as atmospheric, ground and acquisition conditions can cause radiometric differences between the images, hindering therefore the transfer of knowledge from one image to another. The goal of this Thesis is to supply remote sensing image analysts with suitable processing techniques to ensure a robust portability of the classification models across different images. The ultimate purpose is to map the land-cover classes over large spatial and temporal extents with minimal ground information. To overcome, or simply quantify, the observed shifts in the statistical distribution of the spectra of the materials, we study four approaches issued from the field of machine learning. First, we propose a strategy to intelligently sample the image of interest to collect the labels only in correspondence of the most useful pixels. This iterative routine is based on a constant evaluation of the pertinence to the new image of the initial training data actually belonging to a different image. Second, an approach to reduce the radiometric differences among the images by projecting the respective pixels in a common new data space is presented. We analyze a kernel-based feature extraction framework suited for such problems, showing that, after this relative normalization, the cross-image generalization abilities of a classifier are highly increased. Third, we test a new data-driven measure of distance between probability distributions to assess the distortions caused by differences in the acquisition geometry affecting series of multi-angle images. Also, we gauge the portability of classification models through the sequences. In both exercises, the efficacy of classic physically- and statistically-based normalization methods is discussed. Finally, we explore a new family of approaches based on sparse representations of the samples to reciprocally convert the data space of two images. The projection function bridging the images allows a synthesis of new pixels with more similar characteristics ultimately facilitating the land-cover mapping across images.
Resumo:
The distribution of mitochondrial control region-sequence polymorphism was investigated in 15 populations of Crocidura russula along an altitudinal gradient in western Switzerland. High-altitude populations are smaller, sparser and appear to undergo frequent bottlenecks. Accordingly, they showed a loss of rare haplotypes, but unexpectedly, were less differentiated than lowland populations. Furthermore, the major haplotypes segregated significantly with altitude. The results were inconsistent with a simple model of drift and dispersal. They suggested instead a role for historical patterns of colonization, or, alternatively, present-day selective forces acting on one of the mitochondrial genes involved in metabolic pathways.
Resumo:
Measuring school efficiency is a challenging task. First, a performance measurement technique has to be selected. Within Data Envelopment Analysis (DEA), one such technique, alternative models have been developed in order to deal with environmental variables. The majority of these models lead to diverging results. Second, the choice of input and output variables to be included in the efficiency analysis is often dictated by data availability. The choice of the variables remains an issue even when data is available. As a result, the choice of technique, model and variables is probably, and ultimately, a political judgement. Multi-criteria decision analysis methods can help the decision makers to select the most suitable model. The number of selection criteria should remain parsimonious and not be oriented towards the results of the models in order to avoid opportunistic behaviour. The selection criteria should also be backed by the literature or by an expert group. Once the most suitable model is identified, the principle of permanence of methods should be applied in order to avoid a change of practices over time. Within DEA, the two-stage model developed by Ray (1991) is the most convincing model which allows for an environmental adjustment. In this model, an efficiency analysis is conducted with DEA followed by an econometric analysis to explain the efficiency scores. An environmental variable of particular interest, tested in this thesis, consists of the fact that operations are held, for certain schools, on multiple sites. Results show that the fact of being located on more than one site has a negative influence on efficiency. A likely way to solve this negative influence would consist of improving the use of ICT in school management and teaching. Planning new schools should also consider the advantages of being located on a unique site, which allows reaching a critical size in terms of pupils and teachers. The fact that underprivileged pupils perform worse than privileged pupils has been public knowledge since Coleman et al. (1966). As a result, underprivileged pupils have a negative influence on school efficiency. This is confirmed by this thesis for the first time in Switzerland. Several countries have developed priority education policies in order to compensate for the negative impact of disadvantaged socioeconomic status on school performance. These policies have failed. As a result, other actions need to be taken. In order to define these actions, one has to identify the social-class differences which explain why disadvantaged children underperform. Childrearing and literary practices, health characteristics, housing stability and economic security influence pupil achievement. Rather than allocating more resources to schools, policymakers should therefore focus on related social policies. For instance, they could define pre-school, family, health, housing and benefits policies in order to improve the conditions for disadvantaged children.
Resumo:
In this paper, we develop a data-driven methodology to characterize the likelihood of orographic precipitation enhancement using sequences of weather radar images and a digital elevation model (DEM). Geographical locations with topographic characteristics favorable to enforce repeatable and persistent orographic precipitation such as stationary cells, upslope rainfall enhancement, and repeated convective initiation are detected by analyzing the spatial distribution of a set of precipitation cells extracted from radar imagery. Topographic features such as terrain convexity and gradients computed from the DEM at multiple spatial scales as well as velocity fields estimated from sequences of weather radar images are used as explanatory factors to describe the occurrence of localized precipitation enhancement. The latter is represented as a binary process by defining a threshold on the number of cell occurrences at particular locations. Both two-class and one-class support vector machine classifiers are tested to separate the presumed orographic cells from the nonorographic ones in the space of contributing topographic and flow features. Site-based validation is carried out to estimate realistic generalization skills of the obtained spatial prediction models. Due to the high class separability, the decision function of the classifiers can be interpreted as a likelihood or susceptibility of orographic precipitation enhancement. The developed approach can serve as a basis for refining radar-based quantitative precipitation estimates and short-term forecasts or for generating stochastic precipitation ensembles conditioned on the local topography.
Resumo:
Résumé Suite aux recentes avancées technologiques, les archives d'images digitales ont connu une croissance qualitative et quantitative sans précédent. Malgré les énormes possibilités qu'elles offrent, ces avancées posent de nouvelles questions quant au traitement des masses de données saisies. Cette question est à la base de cette Thèse: les problèmes de traitement d'information digitale à très haute résolution spatiale et/ou spectrale y sont considérés en recourant à des approches d'apprentissage statistique, les méthodes à noyau. Cette Thèse étudie des problèmes de classification d'images, c'est à dire de catégorisation de pixels en un nombre réduit de classes refletant les propriétés spectrales et contextuelles des objets qu'elles représentent. L'accent est mis sur l'efficience des algorithmes, ainsi que sur leur simplicité, de manière à augmenter leur potentiel d'implementation pour les utilisateurs. De plus, le défi de cette Thèse est de rester proche des problèmes concrets des utilisateurs d'images satellite sans pour autant perdre de vue l'intéret des méthodes proposées pour le milieu du machine learning dont elles sont issues. En ce sens, ce travail joue la carte de la transdisciplinarité en maintenant un lien fort entre les deux sciences dans tous les développements proposés. Quatre modèles sont proposés: le premier répond au problème de la haute dimensionalité et de la redondance des données par un modèle optimisant les performances en classification en s'adaptant aux particularités de l'image. Ceci est rendu possible par un système de ranking des variables (les bandes) qui est optimisé en même temps que le modèle de base: ce faisant, seules les variables importantes pour résoudre le problème sont utilisées par le classifieur. Le manque d'information étiquétée et l'incertitude quant à sa pertinence pour le problème sont à la source des deux modèles suivants, basés respectivement sur l'apprentissage actif et les méthodes semi-supervisées: le premier permet d'améliorer la qualité d'un ensemble d'entraînement par interaction directe entre l'utilisateur et la machine, alors que le deuxième utilise les pixels non étiquetés pour améliorer la description des données disponibles et la robustesse du modèle. Enfin, le dernier modèle proposé considère la question plus théorique de la structure entre les outputs: l'intègration de cette source d'information, jusqu'à présent jamais considérée en télédétection, ouvre des nouveaux défis de recherche. Advanced kernel methods for remote sensing image classification Devis Tuia Institut de Géomatique et d'Analyse du Risque September 2009 Abstract The technical developments in recent years have brought the quantity and quality of digital information to an unprecedented level, as enormous archives of satellite images are available to the users. However, even if these advances open more and more possibilities in the use of digital imagery, they also rise several problems of storage and treatment. The latter is considered in this Thesis: the processing of very high spatial and spectral resolution images is treated with approaches based on data-driven algorithms relying on kernel methods. In particular, the problem of image classification, i.e. the categorization of the image's pixels into a reduced number of classes reflecting spectral and contextual properties, is studied through the different models presented. The accent is put on algorithmic efficiency and the simplicity of the approaches proposed, to avoid too complex models that would not be used by users. The major challenge of the Thesis is to remain close to concrete remote sensing problems, without losing the methodological interest from the machine learning viewpoint: in this sense, this work aims at building a bridge between the machine learning and remote sensing communities and all the models proposed have been developed keeping in mind the need for such a synergy. Four models are proposed: first, an adaptive model learning the relevant image features has been proposed to solve the problem of high dimensionality and collinearity of the image features. This model provides automatically an accurate classifier and a ranking of the relevance of the single features. The scarcity and unreliability of labeled. information were the common root of the second and third models proposed: when confronted to such problems, the user can either construct the labeled set iteratively by direct interaction with the machine or use the unlabeled data to increase robustness and quality of the description of data. Both solutions have been explored resulting into two methodological contributions, based respectively on active learning and semisupervised learning. Finally, the more theoretical issue of structured outputs has been considered in the last model, which, by integrating outputs similarity into a model, opens new challenges and opportunities for remote sensing image processing.
Resumo:
Introduction : Multimorbidity (MM) is currently a major health concern for hospitalized patients but little is known about the relative importance of MM in the general population. Accordingly we assessed whether MM could be a good predictor of overall mortality. Method : Data from the population based CoLaus Study: 3239 participants (1731 women, mean age 50+/-9 years) followed for a median time of 5.4 years (range 0.4 to 8.5 years). MM was defined as presenting >=2 morbidities according to Barnett et al. (27 items, measured data). Survival analysis was conducted using Cox regression. Results : During follow-up, 53 (1.6%) participants died. Participants who died had a higher number of morbidities (2.4 +/- 1.6 vs. 1.9 +/- 1.5, p<0.05) and had a higher prevalence of MM (69.8% vs. 55.9%, p<0.05). On bivariate analysis, presence of MM (defined as a yes/no variable) was significantly related with overall mortality: relative risk (RR) of 1.84, 95% confidence interval [1.02; 3.31], p<0.05 (see figure), but this association became non-significant after adjusting for age, gender and smoking: RR=1.68 [0.93; 3.04], p=0.09. Similar results were obtained when using the number of morbidities: RR for an extra morbidity 1.22 [1.05; 1.44], p<0.02; after adjusting for age, gender and smoking, RR=1.16 [0.99; 1.37], p=0.07. Conclusion : During a short 5 year observation period, measured MM in the general population is associated with overall mortality. This association becomes borderline significant after multivariate adjustment. These observations will have to be confirmed during a longer follow-up period. This increased mortality in MM patients may require developing specific strategies of screening and prevention.
Resumo:
Remorins (REMs) are proteins of unknown function specific to vascular plants. We have used imaging and biochemical approaches and in situ labeling to demonstrate that REM clusters at plasmodesmata and in approximately 70-nm membrane domains, similar to lipid rafts, in the cytosolic leaflet of the plasma membrane. From a manipulation of REM levels in transgenic tomato (Solanum lycopersicum) plants, we show that Potato virus X (PVX) movement is inversely related to REM accumulation. We show that REM can interact physically with the movement protein TRIPLE GENE BLOCK PROTEIN1 from PVX. Based on the localization of REM and its impact on virus macromolecular trafficking, we discuss the potential for lipid rafts to act as functional components in plasmodesmata and the plasma membrane.
Resumo:
We investigate the relevance of morphological operators for the classification of land use in urban scenes using submetric panchromatic imagery. A support vector machine is used for the classification. Six types of filters have been employed: opening and closing, opening and closing by reconstruction, and opening and closing top hat. The type and scale of the filters are discussed, and a feature selection algorithm called recursive feature elimination is applied to decrease the dimensionality of the input data. The analysis performed on two QuickBird panchromatic images showed that simple opening and closing operators are the most relevant for classification at such a high spatial resolution. Moreover, mixed sets combining simple and reconstruction filters provided the best performance. Tests performed on both images, having areas characterized by different architectural styles, yielded similar results for both feature selection and classification accuracy, suggesting the generalization of the feature sets highlighted.
Resumo:
BACKGROUND: PCR has the potential to detect and precisely quantify specific DNA sequences, but it is not yet often used as a fully quantitative method. A number of data collection and processing strategies have been described for the implementation of quantitative PCR. However, they can be experimentally cumbersome, their relative performances have not been evaluated systematically, and they often remain poorly validated statistically and/or experimentally. In this study, we evaluated the performance of known methods, and compared them with newly developed data processing strategies in terms of resolution, precision and robustness. RESULTS: Our results indicate that simple methods that do not rely on the estimation of the efficiency of the PCR amplification may provide reproducible and sensitive data, but that they do not quantify DNA with precision. Other evaluated methods based on sigmoidal or exponential curve fitting were generally of both poor resolution and precision. A statistical analysis of the parameters that influence efficiency indicated that it depends mostly on the selected amplicon and to a lesser extent on the particular biological sample analyzed. Thus, we devised various strategies based on individual or averaged efficiency values, which were used to assess the regulated expression of several genes in response to a growth factor. CONCLUSION: Overall, qPCR data analysis methods differ significantly in their performance, and this analysis identifies methods that provide DNA quantification estimates of high precision, robustness and reliability. These methods allow reliable estimations of relative expression ratio of two-fold or higher, and our analysis provides an estimation of the number of biological samples that have to be analyzed to achieve a given precision.
Resumo:
OBJECTIVES: To describe variations in the utilization of dental services by persons aged 50+ from 14 European countries and to identify the extent to which such variations are attributable to differences in oral health need and in accessibility of dental care. METHODS: We use data from the Survey of Health, Ageing, and Retirement in Europe (SHARE Waves 2 and 3) and estimate a series of multivariate logistic regression models to analyze variations in dental service utilization (overall dental attendance, preventive treatment and/or operative treatment, dental attendance in early life years) RESULTS: Overall dental attendance and incidence of solely preventive treatment are comparatively high in the Netherlands, Sweden, Denmark, Germany, and Switzerland. In contrast, overall dental attendance is relatively low in Spain, Italy, France, Greece, Poland, and Ireland. Moreover, a high incidence of solely operative treatment is observed in Austria, Italy, and France, whereas in the Netherlands, Sweden, Denmark, Switzerland, and Ireland, the incidence of solely operative treatment is comparably low. By and large, these variations persist even when controlling for cross-country differences in oral health need and in accessibility of dental care. CONCLUSIONS: In comparison with other European regions, there is a tendency toward more frequent and preventive dental treatment of the elderly populations residing in Scandinavia and Western Europe. Such utilization patterns appear only partially attributable to differences in need for and accessibility of dental care.