919 resultados para GRAPHICAL LASSO
Resumo:
Graphical displays which show inter--sample distances are importantfor the interpretation and presentation of multivariate data. Except whenthe displays are two--dimensional, however, they are often difficult tovisualize as a whole. A device, based on multidimensional unfolding, isdescribed for presenting some intrinsically high--dimensional displays infewer, usually two, dimensions. This goal is achieved by representing eachsample by a pair of points, say $R_i$ and $r_i$, so that a theoreticaldistance between the $i$-th and $j$-th samples is represented twice, onceby the distance between $R_i$ and $r_j$ and once by the distance between$R_j$ and $r_i$. Self--distances between $R_i$ and $r_i$ need not be zero.The mathematical conditions for unfolding to exhibit symmetry are established.Algorithms for finding approximate fits, not constrained to be symmetric,are discussed and some examples are given.
Resumo:
In order to interpret the biplot it is necessary to know which points usually variables are the ones that are important contributors to the solution, and this information is available separately as part of the biplot s numerical results. We propose a new scaling of the display, called the contribution biplot, which incorporates this diagnostic directly into the graphical display, showing visually the important contributors and thus facilitating the biplot interpretation and often simplifying the graphical representation considerably. The contribution biplot can be applied to a wide variety of analyses such as correspondence analysis, principal component analysis, log-ratio analysis and the graphical results of a discriminant analysis/MANOVA, in fact to any method based on the singular-value decomposition. In the contribution biplot one set of points, usually the rows of the data matrix, optimally represent the spatial positions of the cases or sample units, according to some distance measure that usually incorporates some form of standardization unless all data are comparable in scale. The other set of points, usually the columns, is represented by vectors that are related to their contributions to the low-dimensional solution. A fringe benefit is that usually only one common scale for row and column points is needed on the principal axes, thus avoiding the problem of enlarging or contracting the scale of one set of points to make the biplot legible. Furthermore, this version of the biplot also solves the problem in correspondence analysis of low-frequency categories that are located on the periphery of the map, giving the false impression that they are important, when they are in fact contributing minimally to the solution.
Resumo:
This paper analyses and discusses arguments that emerge from a recent discussion about the proper assessment of the evidential value of correspondences observed between the characteristics of a crime stain and those of a sample from a suspect when (i) this latter individual is found as a result of a database search and (ii) remaining database members are excluded as potential sources (because of different analytical characteristics). Using a graphical probability approach (i.e., Bayesian networks), the paper here intends to clarify that there is no need to (i) introduce a correction factor equal to the size of the searched database (i.e., to reduce a likelihood ratio), nor to (ii) adopt a propositional level not directly related to the suspect matching the crime stain (i.e., a proposition of the kind 'some person in (outside) the database is the source of the crime stain' rather than 'the suspect (some other person) is the source of the crime stain'). The present research thus confirms existing literature on the topic that has repeatedly demonstrated that the latter two requirements (i) and (ii) should not be a cause of concern.
Resumo:
Statistical computing when input/output is driven by a Graphical User Interface is considered. A proposal is made for automatic control ofcomputational flow to ensure that only strictly required computationsare actually carried on. The computational flow is modeled by a directed graph for implementation in any object-oriented programming language with symbolic manipulation capabilities. A complete implementation example is presented to compute and display frequency based piecewise linear density estimators such as histograms or frequency polygons.
Resumo:
A tool for user choice of the local bandwidth function for a kernel density estimate is developed using KDE, a graphical object-oriented package for interactive kernel density estimation written in LISP-STAT. The bandwidth function is a cubic spline, whose knots are manipulated by the user in one window, while the resulting estimate appears in another window. A real data illustration of this method raises concerns, because an extremely large family of estimates is available.
Resumo:
Any electoral system has an electoral formula that converts voteproportions into parliamentary seats. Pre-electoral polls usually focuson estimating vote proportions and then applying the electoral formulato give a forecast of the parliament's composition. We here describe theproblems arising from this approach: there is always a bias in theforecast. We study the origin of the bias and some methods to evaluateand to reduce it. We propose some rules to compute the sample sizerequired for a given forecast accuracy. We show by Monte Carlo simulationthe performance of the proposed methods using data from Spanish electionsin last years. We also propose graphical methods to visualize how electoralformulae and parliamentary forecasts work (or fail).
Resumo:
El present TFC consisteix en fer un estudi dels sistemes de contrasenya gràfica a tall de presentació per donar a conèixer el seu funcionament. Finalment es desenvolupa una aplicació real del sistema de contrasenya per poder fer proves amb usuaris que serviran per obtenir dades objectives.
Resumo:
This work proposes novel network analysis techniques for multivariate time series.We define the network of a multivariate time series as a graph where verticesdenote the components of the process and edges denote non zero long run partialcorrelations. We then introduce a two step LASSO procedure, called NETS, toestimate high dimensional sparse Long Run Partial Correlation networks. This approachis based on a VAR approximation of the process and allows to decomposethe long run linkages into the contribution of the dynamic and contemporaneousdependence relations of the system. The large sample properties of the estimatorare analysed and we establish conditions for consistent selection and estimation ofthe non zero long run partial correlations. The methodology is illustrated with anapplication to a panel of U.S. bluechips.
Resumo:
The aim of this project is to accomplish an application software based on Matlab to calculate the radioelectrical coverage by surface wave of broadcast radiostations in the band of Medium Wave (WM) all around the world. Also, given the location of a transmitting and a receiving station, the software should be able to calculate the electric field that the receiver should receive at that specific site. In case of several transmitters, the program should search for the existence of Inter-Symbol Interference, and calculate the field strenght accordingly. The application should ask for the configuration parameters of the transmitter radiostation within a Graphical User Interface (GUI), and bring back the resulting coverage above a map of the area under study. For the development of this project, it has been used several conductivity databases of different countries, and a high-resolution elevation database (GLOBE). Also, to calculate the field strenght due to groundwave propagation, it has been used ITU GRWAVE program, which must be integrated into a Matlab interface to be used by the application developed.
Resumo:
Tot seguit presentem un entorn per analitzar senyals de tot tipus amb LDB (Local Discriminant Bases) i MLDB (Modified Local Discriminant Bases). Aquest entorn utilitza funcions desenvolupades en el marc d’una tesi en fase de desenvolupament. Per entendre part d’aquestes funcions es requereix un nivell de coneixement avançat de processament de senyals. S’han extret dels treballs realitzats per Naoki Saito [3], que s’han agafat com a punt de partida per la realització de l’algorisme de la tesi doctoral no finalitzada de Jose Antonio Soria. Aquesta interfície desenvolupada accepta la incorporació de nous paquets i funcions. Hem deixat un menú preparat per integrar Sinus IV packet transform i Cosine IV packet transform, tot i que també podem incorporar-n’hi altres. L’aplicació consta de dues interfícies, un Assistent i una interfície principal. També hem creat una finestra per importar i exportar les variables desitjades a diferents entorns. Per fer aquesta aplicació s’han programat tots els elements de les finestres, en lloc d’utilitzar el GUIDE (Graphical User Interface Development Enviroment) de MATLAB, per tal que sigui compatible entre les diferents versions d’aquest programa. En total hem fet 73 funcions en la interfície principal (d’aquestes, 10 pertanyen a la finestra d’importar i exportar) i 23 en la de l’Assistent. En aquest treball només explicarem 6 funcions i les 3 de creació d’aquestes interfícies per no fer-lo excessivament extens. Les funcions que explicarem són les més importants, ja sigui perquè s’utilitzen sovint, perquè, segons la complexitat McCabe, són les més complicades o perquè són necessàries pel processament del senyal. Passem cada entrada de dades per part de l’usuari per funcions que ens detectaran errors en aquesta entrada, com eliminació de zeros o de caràcters que no siguin números, com comprovar que són enters o que estan dins dels límits màxims i mínims que li pertoquen.
Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation.
Resumo:
BACKGROUND: With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. FOCUS: The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. DATA: We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. METHODS: We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.
Resumo:
We present an analysis of the register of all unemployment episodes in the Grand Duchy of Luxembourg over a recent period of 55 months. We apply propensity score matching to account forthe systematic differences among the groups of subjects (registrants) and unemployment spells.We devise graphical and tabular summaries for describing the sequences of employment states ofthe members of the labour force who register at Agence pour le d?veloppement de l'emploi, theLuxembourg Public Unemployment Agency. Some employment-related information about themis collected by linking their records to the national register of social security contributions, maintained by Inspection g?n?rale de la s?curit? sociale. A class of univariate indices for characterisingthe sequences of labour force states is defined.
Resumo:
Rapid growth in the availability and use of digital documents has prompted the development of instruments to handle them. A most important example of these instruments are digital identifiers, which provide a codification system that allows digital items, usually up to the level of a computer file, to be singled out and located. Digital identifiers make up standardized global systems applied to specific products or areas. They are part of the very many identifiers developed to handle large numbers of items and large amounts of information for transactional purposes, which often have a global span. Digital identifiers include the ubiquitous Global Trade Item Number (GTIN), a code that unequivocally identifies trade items all around the world. The GTIN can take on several configurations depending on its application. These include: EAN-13, EAN-8, EAN-14, and UCC-12. EAN-13 is the code used for retail products in order to facilitate trade at the point of sale; its widely known symbol or graphical form is the EAN/UPC-13 bar code...
Resumo:
SUMMARY: ExpressionView is an R package that provides an interactive graphical environment to explore transcription modules identified in gene expression data. A sophisticated ordering algorithm is used to present the modules with the expression in a visually appealing layout that provides an intuitive summary of the results. From this overview, the user can select individual modules and access biologically relevant metadata associated with them. AVAILABILITY: http://www.unil.ch/cbg/ExpressionView. Screenshots, tutorials and sample data sets can be found on the ExpressionView web site.
Resumo:
The coverage and volume of geo-referenced datasets are extensive and incessantly¦growing. The systematic capture of geo-referenced information generates large volumes¦of spatio-temporal data to be analyzed. Clustering and visualization play a key¦role in the exploratory data analysis and the extraction of knowledge embedded in¦these data. However, new challenges in visualization and clustering are posed when¦dealing with the special characteristics of this data. For instance, its complex structures,¦large quantity of samples, variables involved in a temporal context, high dimensionality¦and large variability in cluster shapes.¦The central aim of my thesis is to propose new algorithms and methodologies for¦clustering and visualization, in order to assist the knowledge extraction from spatiotemporal¦geo-referenced data, thus improving making decision processes.¦I present two original algorithms, one for clustering: the Fuzzy Growing Hierarchical¦Self-Organizing Networks (FGHSON), and the second for exploratory visual data analysis:¦the Tree-structured Self-organizing Maps Component Planes. In addition, I present¦methodologies that combined with FGHSON and the Tree-structured SOM Component¦Planes allow the integration of space and time seamlessly and simultaneously in¦order to extract knowledge embedded in a temporal context.¦The originality of the FGHSON lies in its capability to reflect the underlying structure¦of a dataset in a hierarchical fuzzy way. A hierarchical fuzzy representation of¦clusters is crucial when data include complex structures with large variability of cluster¦shapes, variances, densities and number of clusters. The most important characteristics¦of the FGHSON include: (1) It does not require an a-priori setup of the number¦of clusters. (2) The algorithm executes several self-organizing processes in parallel.¦Hence, when dealing with large datasets the processes can be distributed reducing the¦computational cost. (3) Only three parameters are necessary to set up the algorithm.¦In the case of the Tree-structured SOM Component Planes, the novelty of this algorithm¦lies in its ability to create a structure that allows the visual exploratory data analysis¦of large high-dimensional datasets. This algorithm creates a hierarchical structure¦of Self-Organizing Map Component Planes, arranging similar variables' projections in¦the same branches of the tree. Hence, similarities on variables' behavior can be easily¦detected (e.g. local correlations, maximal and minimal values and outliers).¦Both FGHSON and the Tree-structured SOM Component Planes were applied in¦several agroecological problems proving to be very efficient in the exploratory analysis¦and clustering of spatio-temporal datasets.¦In this thesis I also tested three soft competitive learning algorithms. Two of them¦well-known non supervised soft competitive algorithms, namely the Self-Organizing¦Maps (SOMs) and the Growing Hierarchical Self-Organizing Maps (GHSOMs); and the¦third was our original contribution, the FGHSON. Although the algorithms presented¦here have been used in several areas, to my knowledge there is not any work applying¦and comparing the performance of those techniques when dealing with spatiotemporal¦geospatial data, as it is presented in this thesis.¦I propose original methodologies to explore spatio-temporal geo-referenced datasets¦through time. Our approach uses time windows to capture temporal similarities and¦variations by using the FGHSON clustering algorithm. The developed methodologies¦are used in two case studies. In the first, the objective was to find similar agroecozones¦through time and in the second one it was to find similar environmental patterns¦shifted in time.¦Several results presented in this thesis have led to new contributions to agroecological¦knowledge, for instance, in sugar cane, and blackberry production.¦Finally, in the framework of this thesis we developed several software tools: (1)¦a Matlab toolbox that implements the FGHSON algorithm, and (2) a program called¦BIS (Bio-inspired Identification of Similar agroecozones) an interactive graphical user¦interface tool which integrates the FGHSON algorithm with Google Earth in order to¦show zones with similar agroecological characteristics.