967 resultados para Data matrix
                                
Resumo:
Embora os escamados sejam comumente encontrados em sítios fossilíferos cenozóicos sul−americanos, materiais esqueléticos completos são raros. Apenas alguns poucos exemplares assim foram registrados, com a maioria dos achados representando materiais fragmentários de crânio e mandíbulas ou vértebras isoladas. Dentre as localidades provedoras de vertebrados fósseis na América do Sul, a Formação Chichínales se destaca pela recente descoberta, em seus sedimentos, de um crânio quase completo de um lagarto teiídeo previamente desconhecido. Dada a fauna associada, a idade da formação é definida como Mioceno Temprano (Colhuehuapense). No presente estudo, conclui−se, através de uma análise filogenética contendo 39 espécies viventes e fósseis de escamados e 149 caracteres osteológicos, que este material pertence a uma nova espécie do gênero contemporâneo Callopistes. Uma descrição morfológica detalhada do fóssil, obtida através de análises estereoscópicas e de microtomografia computadorizada de alta resolução (CT Scan), também é apresentada. A matriz morfológica foi analisada com o auxílio do software TNT Versão 1.1, seguindo o princípio de máxima parcimônia, com todos os caracteres tratados com a mesma pesagem, resultando em quatro árvores igualmente parcimoniosas, que foram então utilizadas para a construção de uma árvore de consenso estrito. Em todas as quatro árvores, o novo táxon posicionou−se dentro da família Teiidae como um membro do clado formado pelas demais espécies viventes de Callopistes. Entretanto, não foi possível estabelecer uma relação de grupo−irmão inequívoca entre as duas espécies de Callopistes presentes na análise e o fóssil. A atual distribuição das duas espécies viventes de Callopistes e a localidade de onde foi recuperado o fóssil em estudo indicam que esse gênero possuía uma distribuição muito mais ampla no passado, chegando a áreas patagônicas cis−Andinas, diferentemente das áreas trans−Andinas de altitude onde as duas espécies atuais estão restritas
                                
Resumo:
As análises biplot que utilizam os modelos de efeitos principais aditivos com inter- ação multiplicativa (AMMI) requerem matrizes de dados completas, mas, frequentemente os ensaios multiambientais apresentam dados faltantes. Nesta tese são propostas novas metodologias de imputação simples e múltipla que podem ser usadas para analisar da- dos desbalanceados em experimentos com interação genótipo por ambiente (G×E). A primeira, é uma nova extensão do método de validação cruzada por autovetor (Bro et al, 2008). A segunda, corresponde a um novo algoritmo não-paramétrico obtido por meio de modificações no método de imputação simples desenvolvido por Yan (2013). Também é incluído um estudo que considera sistemas de imputação recentemente relatados na literatura e os compara com o procedimento clássico recomendado para imputação em ensaios (G×E), ou seja, a combinação do algoritmo de Esperança-Maximização com os modelos AMMI ou EM-AMMI. Por último, são fornecidas generalizações da imputação simples descrita por Arciniegas-Alarcón et al. (2010) que mistura regressão com aproximação de posto inferior de uma matriz. Todas as metodologias têm como base a decomposição por valores singulares (DVS), portanto, são livres de pressuposições distribucionais ou estruturais. Para determinar o desempenho dos novos esquemas de imputação foram realizadas simulações baseadas em conjuntos de dados reais de diferentes espécies, com valores re- tirados aleatoriamente em diferentes porcentagens e a qualidade das imputações avaliada com distintas estatísticas. Concluiu-se que a DVS constitui uma ferramenta útil e flexível na construção de técnicas eficientes que contornem o problema de perda de informação em matrizes experimentais.
                                
Resumo:
his paper discusses a process to graphically view and analyze information obtained from a network of urban streets, using an algorithm that establishes a ranking of importance of the nodes of the network itself. The basis of this process is to quantify the network information obtained by assigning numerical values to each node, representing numerically the information. These values are used to construct a data matrix that allows us to apply a classification algorithm of nodes in a network in order of importance. From this numerical ranking of the nodes, the process finish with the graphical visualization of the network. An example is shown to illustrate the whole process.
                                
Resumo:
We propose and discuss a new centrality index for urban street patterns represented as networks in geographical space. This centrality measure, that we call ranking-betweenness centrality, combines the idea behind the random-walk betweenness centrality measure and the idea of ranking the nodes of a network produced by an adapted PageRank algorithm. We initially use a PageRank algorithm in which we are able to transform some information of the network that we want to analyze into numerical values. Numerical values summarizing the information are associated to each of the nodes by means of a data matrix. After running the adapted PageRank algorithm, a ranking of the nodes is obtained, according to their importance in the network. This classification is the starting point for applying an algorithm based on the random-walk betweenness centrality. A detailed example of a real urban street network is discussed in order to understand the process to evaluate the ranking-betweenness centrality proposed, performing some comparisons with other classical centrality measures.
                                
Resumo:
Urban researchers and planners are often interested in understanding how economic activities are distributed in urban regions, what forces influence their special pattern and how urban structure and functions are mutually dependent. In this paper, we want to show how an algorithm for ranking the nodes in a network can be used to understand and visualize certain commercial activities of a city. The first part of the method consists of collecting real information about different types of commercial activities at each location in the urban network of the city of Murcia, Spain. Four clearly differentiated commercial activities are studied, such as restaurants and bars, shops, banks and supermarkets or department stores, but obviously we can study other. The information collected is then quantified by means of a data matrix, which is used as the basis for the implementation of a PageRank algorithm which produces a ranking of all the nodes in the network, according to their significance within it. Finally, we visualize the resulting classification using a colour scale that helps us to represent the business network.
                                
Resumo:
Based on morphological features alone, there is considerable difficulty in identifying the 5 most economically damaging weed species of Sporobolus [ viz. S. pyramidalis P. Beauv., S. natalensis ( Steud.) Dur and Schinz, S. fertilis ( Steud.) Clayton, S. africanus (Poir.) Robyns and Tourney, and S. jacquemontii Kunth.] found in Australia. A polymerase chain reaction (PCR)-based random amplified polymorphic DNA ( RAPD) technique was used to create a series of genetic markers that could positively identify the 5 major weeds from the other less damaging weedy and native Sporobolus species. In the initial RAPD pro. ling experiment, using arbitrarily selected primers and involving 12 species of Sporobolus, 12 genetic markers were found that, when used in combination, could consistently identify the 5 weedy species from all others. Of these 12 markers, the most diagnostic were UBC51(490) for S. pyramidalis and S. natalensis; UBC43(310,2000,2100) for S. fertilis and S. africanus; and OPA20(850) and UBC43(470) for S. jacquemontii. Species-specific markers could be found only for S. jacquemontii. In an effort to understand why there was difficulty in obtaining species-specific markers for some of the weedy species, a RAPD data matrix was created using 40 RAPD products. These 40 products amplified by 6 random primers from 45 individuals belonging to 12 species, were then subjected to numerical taxonomy and multivariate system (NTSYS pc version 1.70) analysis. The RAPD similarity matrix generated from the analysis indicated that S. pyramidalis was genetically more similar to S. natalensis than to other species of the 'S. indicus complex'. Similarly, S. jacquemontii was more similar to S. pyramidalis, and S. fertilis was more similar to S. africanus than to other species of the complex. Sporobolus pyramidalis, S. jacquemontii, S. africanus, and S. creber exhibited a low within-species genetic diversity, whereas high genetic diversity was observed within S. natalensis, S. fertilis, S. sessilis, S. elongates, and S. laxus. Cluster analysis placed all of the introduced species ( major and minor weedy species) into one major cluster, with S. pyramidalis and S. natalensis in one distinct subcluster and S. fertilis and S. africanus in another. The native species formed separate clusters in the phenograms. The close genetic similarity of S. pyramidalis to S. natalensis, and S. fertilis to S. africanus may explain the difficulty in obtaining RAPD species-specific markers. The importance of these results will be within the Australian dairy and beef industries and will aid in the development of integrated management strategy for these weeds.
                                
Resumo:
This thesis seeks to describe the development of an inexpensive and efficient clustering technique for multivariate data analysis. The technique starts from a multivariate data matrix and ends with graphical representation of the data and pattern recognition discriminant function. The technique also results in distances frequency distribution that might be useful in detecting clustering in the data or for the estimation of parameters useful in the discrimination between the different populations in the data. The technique can also be used in feature selection. The technique is essentially for the discovery of data structure by revealing the component parts of the data. lhe thesis offers three distinct contributions for cluster analysis and pattern recognition techniques. The first contribution is the introduction of transformation function in the technique of nonlinear mapping. The second contribution is the us~ of distances frequency distribution instead of distances time-sequence in nonlinear mapping, The third contribution is the formulation of a new generalised and normalised error function together with its optimal step size formula for gradient method minimisation. The thesis consists of five chapters. The first chapter is the introduction. The second chapter describes multidimensional scaling as an origin of nonlinear mapping technique. The third chapter describes the first developing step in the technique of nonlinear mapping that is the introduction of "transformation function". The fourth chapter describes the second developing step of the nonlinear mapping technique. This is the use of distances frequency distribution instead of distances time-sequence. The chapter also includes the new generalised and normalised error function formulation. Finally, the fifth chapter, the conclusion, evaluates all developments and proposes a new program. for cluster analysis and pattern recognition by integrating all the new features.
                                
Resumo:
Quantitative estimation of surface ocean productivity and bottom water oxygen concentration with benthic foraminifera was attempted using 70 samples from equatorial and North Pacific surface sediments. These samples come from a well defined depth range in the ocean, between 2200 and 3200 m, so that depth related factors do not interfere with the estimation. Samples were selected so that foraminifera were well preserved in the sediments and temperature and salinity were nearly uniform (T = 1.5° C; S = 34.6 per mil). The sample set was also assembled so as to minimize the correlation often seen between surface ocean productivity and bottom water oxygen values (r**2 = 0.23 for prediction purposes in this case). This procedure reduced the chances of spurious results due to correlations between the environmental variables. The samples encompass a range of productivities from about 25 to >300 gC m**-2 yr**-1, and a bottom water oxygen range from 1.8 to 3.5 ml/L. Benthic foraminiferal assemblages were quantified using the >62 µm fraction of the sediments and 46 taxon categories. MANOVA multivariate regression was used to project the faunal matrix onto the two environmental dimensions using published values for productivity and bottom water oxygen to calibrate this operation. The success of this regression was measured with the multivariate r? which was 0.98 for the productivity dimension and 0.96 for the oxygen dimension. These high coefficients indicate that both environmental variables are strongly imbedded in the faunal data matrix. Analysis of the beta regression coefficients shows that the environmental signals are carried by groups of taxa which are consistent with previous work characterizing benthic foraminiferal responses to productivity and bottom water oxygen. The results of this study suggest that benthic foraminiferal assemblages can be used for quantitative reconstruction of surface ocean productivity and bottom water oxygen concentrations if suitable surface sediment calibration data sets are developed and appropriate means for detecting no-analog samples are found.
                                
Resumo:
Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.
While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.
For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.
                                
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08
                                
Resumo:
                                A concentração de seis elementos: Cd, Cr, Pb, Ni, Zn e Fe foi medida em sessenta e sete amostras de sumos de fruta 100 %, duas amostras de refrigerantes, dez amostras de concentrados de sumos e sete amostras de águas de diluição utilizadas no processamento dos sumos. As amostras de sumos representam numa prespectiva bastante abrangente o mercado Português de sumos de fruta 100 %. Os refrigerantes concentrados e águas de diluição foram cedidos por duas empresas fabricantes de sumos Portuguesas. As concentrações elementares foram medidas pelas técnicas de FAAS e GFAAS e foi medido também o grau Brix dos sumos. Os factores: fruta, percentagem de fruta, origem, agricultura, tratamento, embalagem, conservação e processo foram obtidos por informação do fabricante nos rótulos dos produtos e por contacto directo. Caracterizou-se o mercado em termos da concentração desses elementos e
caracterizou-se a sua diluição comparando-a com valores de referência do mercado Europeu. Mediu-se o grau de associação entre os diversos parâmetros e a concentração final elementar dos sumos e utilizou-se a análise de agrupamentos, a análise de correspondência múltipla e a análise factorial para reestruturar a matriz de dados. Dos resultados obtidos, os sumos de fruta apresentam a seguinte ordem de grandeza nas suas concentrações elementares: Cd 
                                
Resumo:
The paper catalogues the procedures and steps involved in agroclimatic classification. These vary from conventional descriptive methods to modern computer-based numerical techniques. There are three mutually independent numerical classification techniques, namely Ordination, Cluster analysis, and Minimum spanning tree; and under each technique there are several forms of grouping techniques existing. The vhoice of numerical classification procedure differs with the type of data set. In the case of numerical continuous data sets with booth positive and negative values, the simple and least controversial procedures are unweighted pair group method (UPGMA) and weighted pair group method (WPGMA) under clustering techniques with similarity measure obtained either from Gower metric or standardized Euclidean metric. Where the number of attributes are large, these could be reduced to fewer new attributes defined by the principal components or coordinates by ordination technique. The first few components or coodinates explain the maximum variance in the data matrix. These revided attributes are less affected by noise in the data set. It is possible to check misclassifications using minimum spanning tree.
                                
Resumo:
The main objective of this study is to apply recently developed methods of physical-statistic to time series analysis, particularly in electrical induction s profiles of oil wells data, to study the petrophysical similarity of those wells in a spatial distribution. For this, we used the DFA method in order to know if we can or not use this technique to characterize spatially the fields. After obtain the DFA values for all wells, we applied clustering analysis. To do these tests we used the non-hierarchical method called K-means. Usually based on the Euclidean distance, the K-means consists in dividing the elements of a data matrix N in k groups, so that the similarities among elements belonging to different groups are the smallest possible. In order to test if a dataset generated by the K-means method or randomly generated datasets form spatial patterns, we created the parameter Ω (index of neighborhood). High values of Ω reveals more aggregated data and low values of Ω show scattered data or data without spatial correlation. Thus we concluded that data from the DFA of 54 wells are grouped and can be used to characterize spatial fields. Applying contour level technique we confirm the results obtained by the K-means, confirming that DFA is effective to perform spatial analysis
                                
Resumo:
Summary
                                
Resumo:
Dissolved organic matter (DOM) is a complex mixture of organic compounds, ubiquitous in marine and freshwater systems. Fluorescence spectroscopy, by means of Excitation-Emission Matrices (EEM), has become an indispensable tool to study DOM sources, transport and fate in aquatic ecosystems. However the statistical treatment of large and heterogeneous EEM data sets still represents an important challenge for biogeochemists. Recently, Self-Organising Maps (SOM) has been proposed as a tool to explore patterns in large EEM data sets. SOM is a pattern recognition method which clusterizes and reduces the dimensionality of input EEMs without relying on any assumption about the data structure. In this paper, we show how SOM, coupled with a correlation analysis of the component planes, can be used both to explore patterns among samples, as well as to identify individual fluorescence components. We analysed a large and heterogeneous EEM data set, including samples from a river catchment collected under a range of hydrological conditions, along a 60-km downstream gradient, and under the influence of different degrees of anthropogenic impact. According to our results, chemical industry effluents appeared to have unique and distinctive spectral characteristics. On the other hand, river samples collected under flash flood conditions showed homogeneous EEM shapes. The correlation analysis of the component planes suggested the presence of four fluorescence components, consistent with DOM components previously described in the literature. A remarkable strength of this methodology was that outlier samples appeared naturally integrated in the analysis. We conclude that SOM coupled with a correlation analysis procedure is a promising tool for studying large and heterogeneous EEM data sets.
 
                    