20 resultados para data-driven Stochastic Subspace Identification (SSI-data)
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo
Resumo:
Neste artigo é apresentada uma visão geral sobre o problema de identificação por subespaços em malha aberta. Existem diversos algoritmos que solucionam este problema (MOESP, DSR, N4SID, CVA). Baseado nos métodos MOESP e N4SID os autores apresentam um algoritmo alternativo para identificar sistemas determinísticos operando em malha aberta. Dois processos simulados, um SISO e um MIMO são usados para mostrar o desempenho deste algoritmo.
Resumo:
Complexity in time series is an intriguing feature of living dynamical systems, with potential use for identification of system state. Although various methods have been proposed for measuring physiologic complexity, uncorrelated time series are often assigned high values of complexity, errouneously classifying them as a complex physiological signals. Here, we propose and discuss a method for complex system analysis based on generalized statistical formalism and surrogate time series. Sample entropy (SampEn) was rewritten inspired in Tsallis generalized entropy, as function of q parameter (qSampEn). qSDiff curves were calculated, which consist of differences between original and surrogate series qSampEn. We evaluated qSDiff for 125 real heart rate variability (HRV) dynamics, divided into groups of 70 healthy, 44 congestive heart failure (CHF), and 11 atrial fibrillation (AF) subjects, and for simulated series of stochastic and chaotic process. The evaluations showed that, for nonperiodic signals, qSDiff curves have a maximum point (qSDiff(max)) for q not equal 1. Values of q where the maximum point occurs and where qSDiff is zero were also evaluated. Only qSDiff(max) values were capable of distinguish HRV groups (p-values 5.10 x 10(-3); 1.11 x 10(-7), and 5.50 x 10(-7) for healthy vs. CHF, healthy vs. AF, and CHF vs. AF, respectively), consistently with the concept of physiologic complexity, and suggests a potential use for chaotic system analysis. (C) 2012 American Institute of Physics. [http://dx.doi.org/10.1063/1.4758815]
Resumo:
The reproductive performance of cattle may be influenced by several factors, but mineral imbalances are crucial in terms of direct effects on reproduction. Several studies have shown that elements such as calcium, copper, iron, magnesium, selenium, and zinc are essential for reproduction and can prevent oxidative stress. However, toxic elements such as lead, nickel, and arsenic can have adverse effects on reproduction. In this paper, we applied a simple and fast method of multi-element analysis to bovine semen samples from Zebu and European classes used in reproduction programs and artificial insemination. Samples were analyzed by inductively coupled plasma spectrometry (ICP-MS) using aqueous medium calibration and the samples were diluted in a proportion of 1:50 in a solution containing 0.01% (vol/vol) Triton X-100 and 0.5% (vol/vol) nitric acid. Rhodium, iridium, and yttrium were used as the internal standards for ICP-MS analysis. To develop a reliable method of tracing the class of bovine semen, we used data mining techniques that make it possible to classify unknown samples after checking the differentiation of known-class samples. Based on the determination of 15 elements in 41 samples of bovine semen, 3 machine-learning tools for classification were applied to determine cattle class. Our results demonstrate the potential of support vector machine (SVM), multilayer perceptron (MLP), and random forest (RF) chemometric tools to identify cattle class. Moreover, the selection tools made it possible to reduce the number of chemical elements needed from 15 to just 8.
Resumo:
A common interest in gene expression data analysis is to identify from a large pool of candidate genes the genes that present significant changes in expression levels between a treatment and a control biological condition. Usually, it is done using a statistic value and a cutoff value that are used to separate the genes differentially and nondifferentially expressed. In this paper, we propose a Bayesian approach to identify genes differentially expressed calculating sequentially credibility intervals from predictive densities which are constructed using the sampled mean treatment effect from all genes in study excluding the treatment effect of genes previously identified with statistical evidence for difference. We compare our Bayesian approach with the standard ones based on the use of the t-test and modified t-tests via a simulation study, using small sample sizes which are common in gene expression data analysis. Results obtained report evidence that the proposed approach performs better than standard ones, especially for cases with mean differences and increases in treatment variance in relation to control variance. We also apply the methodologies to a well-known publicly available data set on Escherichia coli bacterium.
Resumo:
Abstract Background One goal of gene expression profiling is to identify signature genes that robustly distinguish different types or grades of tumors. Several tumor classifiers based on expression profiling have been proposed using microarray technique. Due to important differences in the probabilistic models of microarray and SAGE technologies, it is important to develop suitable techniques to select specific genes from SAGE measurements. Results A new framework to select specific genes that distinguish different biological states based on the analysis of SAGE data is proposed. The new framework applies the bolstered error for the identification of strong genes that separate the biological states in a feature space defined by the gene expression of a training set. Credibility intervals defined from a probabilistic model of SAGE measurements are used to identify the genes that distinguish the different states with more reliability among all gene groups selected by the strong genes method. A score taking into account the credibility and the bolstered error values in order to rank the groups of considered genes is proposed. Results obtained using SAGE data from gliomas are presented, thus corroborating the introduced methodology. Conclusion The model representing counting data, such as SAGE, provides additional statistical information that allows a more robust analysis. The additional statistical information provided by the probabilistic model is incorporated in the methodology described in the paper. The introduced method is suitable to identify signature genes that lead to a good separation of the biological states using SAGE and may be adapted for other counting methods such as Massive Parallel Signature Sequencing (MPSS) or the recent Sequencing-By-Synthesis (SBS) technique. Some of such genes identified by the proposed method may be useful to generate classifiers.
Resumo:
With the increasing production of information from e-government initiatives, there is also the need to transform a large volume of unstructured data into useful information for society. All this information should be easily accessible and made available in a meaningful and effective way in order to achieve semantic interoperability in electronic government services, which is a challenge to be pursued by governments round the world. Our aim is to discuss the context of e-Government Big Data and to present a framework to promote semantic interoperability through automatic generation of ontologies from unstructured information found in the Internet. We propose the use of fuzzy mechanisms to deal with natural language terms and present some related works found in this area. The results achieved in this study are based on the architectural definition and major components and requirements in order to compose the proposed framework. With this, it is possible to take advantage of the large volume of information generated from e-Government initiatives and use it to benefit society.
Resumo:
Background: A current challenge in gene annotation is to define the gene function in the context of the network of relationships instead of using single genes. The inference of gene networks (GNs) has emerged as an approach to better understand the biology of the system and to study how several components of this network interact with each other and keep their functions stable. However, in general there is no sufficient data to accurately recover the GNs from their expression levels leading to the curse of dimensionality, in which the number of variables is higher than samples. One way to mitigate this problem is to integrate biological data instead of using only the expression profiles in the inference process. Nowadays, the use of several biological information in inference methods had a significant increase in order to better recover the connections between genes and reduce the false positives. What makes this strategy so interesting is the possibility of confirming the known connections through the included biological data, and the possibility of discovering new relationships between genes when observed the expression data. Although several works in data integration have increased the performance of the network inference methods, the real contribution of adding each type of biological information in the obtained improvement is not clear. Methods: We propose a methodology to include biological information into an inference algorithm in order to assess its prediction gain by using biological information and expression profile together. We also evaluated and compared the gain of adding four types of biological information: (a) protein-protein interaction, (b) Rosetta stone fusion proteins, (c) KEGG and (d) KEGG+GO. Results and conclusions: This work presents a first comparison of the gain in the use of prior biological information in the inference of GNs by considering the eukaryote (P. falciparum) organism. Our results indicates that information based on direct interaction can produce a higher improvement in the gain than data about a less specific relationship as GO or KEGG. Also, as expected, the results show that the use of biological information is a very important approach for the improvement of the inference. We also compared the gain in the inference of the global network and only the hubs. The results indicates that the use of biological information can improve the identification of the most connected proteins.
Resumo:
Each plasma physics laboratory has a proprietary scheme to control and data acquisition system. Usually, it is different from one laboratory to another. It means that each laboratory has its own way to control the experiment and retrieving data from the database. Fusion research relies to a great extent on international collaboration and this private system makes it difficult to follow the work remotely. The TCABR data analysis and acquisition system has been upgraded to support a joint research programme using remote participation technologies. The choice of MDSplus (Model Driven System plus) is proved by the fact that it is widely utilized, and the scientists from different institutions may use the same system in different experiments in different tokamaks without the need to know how each system treats its acquisition system and data analysis. Another important point is the fact that the MDSplus has a library system that allows communication between different types of language (JAVA, Fortran, C, C++, Python) and programs such as MATLAB, IDL, OCTAVE. In the case of tokamak TCABR interfaces (object of this paper) between the system already in use and MDSplus were developed, instead of using the MDSplus at all stages, from the control, and data acquisition to the data analysis. This was done in the way to preserve a complex system already in operation and otherwise it would take a long time to migrate. This implementation also allows add new components using the MDSplus fully at all stages. (c) 2012 Elsevier B.V. All rights reserved.
Resumo:
Introduction: The widespread screening programs prompted a decrease in prostate cancer stage at diagnosis, and active surveillance is an option for patients who may harbor clinically insignificant prostate cancer (IPC). Pathologists include the possibility of an IPC in their reports based on the Gleason score and tumor volume. This study determined the accuracy of pathological data in the identification of IPC in radical prostatectomy (RP) specimens. Materials and Methods: Of 592 radical prostatectomy specimens examined in our laboratory from 2001 to 2010, 20 patients harbored IPC and exhibited biopsy findings suggestive of IPC. These biopsy features served as the criteria to define patients with potentially insignificant tumor in this population. The results of the prostate biopsies and surgical specimens of the 592 patients were compared. Results: The twenty patients who had IPC in both biopsy and RP were considered real positive cases. All patients were divided into groups based on their diagnoses following RP: true positives (n = 20), false positives (n = 149), true negatives (n = 421), false negatives (n = 2). The accuracy of the pathological data alone for the prediction of IPC was 91.4%, the sensitivity was 91% and the specificity was 74%. Conclusion: The identification of IPC using pathological data exclusively is accurate, and pathologists should suggest this in their reports to aid surgeons, urologists and radiotherapists to decide the best treatment for their patients.
Resumo:
This work proposes a method for data clustering based on complex networks theory. A data set is represented as a network by considering different metrics to establish the connection between each pair of objects. The clusters are obtained by taking into account five community detection algorithms. The network-based clustering approach is applied in two real-world databases and two sets of artificially generated data. The obtained results suggest that the exponential of the Minkowski distance is the most suitable metric to quantify the similarities between pairs of objects. In addition, the community identification method based on the greedy optimization provides the best cluster solution. We compare the network-based clustering approach with some traditional clustering algorithms and verify that it provides the lowest classification error rate. (C) 2012 Elsevier B.V. All rights reserved.
Resumo:
We present a new catalogue of galaxy triplets derived from the Sloan Digital Sky Survey (SDSS) Data Release 7. The identification of systems was performed considering galaxies brighter than Mr=-20.5 and imposing constraints over the projected distances, radial velocity differences of neighbouring galaxies and isolation. To improve the identification of triplets, we employed a data pixelization scheme, which allows us to handle large amounts of data as in the SDSS photometric survey. Using spectroscopic and photometric data in the redshift range 0.01 =z= 0.40, we obtain 5901 triplet candidates. We have used a mock catalogue to analyse the completeness and contamination of our methods. The results show a high level of completeness ( 80 per cent) and low contamination ( 5 per cent). By using photometric and spectroscopic data, we have also addressed the effects of fibre collisions in the spectroscopic sample. We have defined an isolation criterion considering the distance of the triplet brightest galaxy to the closest neighbour cluster, to describe a global environment, as well as the galaxies within a fixed aperture, around the triplet brightest galaxy, to measure the local environment. The final catalogue comprises 1092 isolated triplets of galaxies in the redshift range 0.01 =z= 0.40. Our results show that photometric redshifts provide very useful information, allowing us to complete the sample of nearby systems whose detection is affected by fibre collisions, as well as extending the detection of triplets to large distances, where spectroscopic redshifts are not available.
Resumo:
This paper presents a method for transforming the information of an engineering geological map into useful information for non-specialists involved in land-use planning. The method consists of classifying the engineering geological units in terms of land use capability and identifying the legal and the geologic restrictions that apply in the study area. Both informations are then superimposed over the land use and a conflict areas map is created. The analysis of these data leads to the identification of existing and forthcoming land use conflicts and enables the proposal of planning measures on a regional and local scale. The map for the regional planning was compiled at a 1:50,000 scale and encompasses the whole municipal land area where uses are mainly rural. The map for the local planning was compiled at a 1:10,000 scale and encompasses the urban area. Most of the classification and operations on maps used spatial analyst tools available in the Geographical Information System. The regional studies showed that the greater part of Analandia's territory presents appropriate land uses. The local-scale studies indicate that the majority of the densely occupied urban areas are in suitable land. Although the situation is in general positive, municipal policies should address the identified and expected land use conflicts, so that it can be further improved.
Resumo:
Multi-element analysis of honey samples was carried out with the aim of developing a reliable method of tracing the origin of honey. Forty-two chemical elements were determined (Al, Cu, Pb, Zn, Mn, Cd, Tl, Co, Ni, Rb, Ba, Be, Bi, U, V, Fe, Pt, Pd, Te, Hf, Mo, Sn, Sb, P, La, Mg, I, Sm, Tb, Dy, Sd, Th, Pr, Nd, Tm, Yb, Lu, Gd, Ho, Er, Ce, Cr) by inductively coupled plasma mass spectrometry (ICP-MS). Then, three machine learning tools for classification and two for attribute selection were applied in order to prove that it is possible to use data mining tools to find the region where honey originated. Our results clearly demonstrate the potential of Support Vector Machine (SVM), Multilayer Perceptron (MLP) and Random Forest (RF) chemometric tools for honey origin identification. Moreover, the selection tools allowed a reduction from 42 trace element concentrations to only 5. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Stochastic methods based on time-series modeling combined with geostatistics can be useful tools to describe the variability of water-table levels in time and space and to account for uncertainty. Monitoring water-level networks can give information about the dynamic of the aquifer domain in both dimensions. Time-series modeling is an elegant way to treat monitoring data without the complexity of physical mechanistic models. Time-series model predictions can be interpolated spatially, with the spatial differences in water-table dynamics determined by the spatial variation in the system properties and the temporal variation driven by the dynamics of the inputs into the system. An integration of stochastic methods is presented, based on time-series modeling and geostatistics as a framework to predict water levels for decision making in groundwater management and land-use planning. The methodology is applied in a case study in a Guarani Aquifer System (GAS) outcrop area located in the southeastern part of Brazil. Communication of results in a clear and understandable form, via simulated scenarios, is discussed as an alternative, when translating scientific knowledge into applications of stochastic hydrogeology in large aquifers with limited monitoring network coverage like the GAS.
Resumo:
Context. Be stars are rapidly rotating stars with a circumstellar decretion disk. They usually undergo pressure and/or gravity pulsation modes excited by the kappa-mechanism, i.e. an effect of the opacity of iron-peak elements in the envelope of the star. In the Milky Way, p-modes are observed in stars that are hotter than or equal to the B3 spectral type, while g-modes are observed at the B2 spectral type and cooler. Aims. We observed a B0IVe star, HD51452, with the high-precision, high-cadence photometric CoRoT satellite and high-resolution, ground-based HARPS and SOPHIE spectrographs to study its pulsations in great detail. We also used the lower resolution spectra available in the BeSS database. Methods. We analyzed the CoRoT and spectroscopic data with several methods: CLEAN-NG, FREQFIND, and a sliding window method. We also analyzed spectral quantities, such as the violet over red (V/R) emission variations, to obtain information about the variation in the circumstellar environment. We calculated a stellar structure model with the ESTER code to test the various interpretation of the results. Results. We detect 189 frequencies of variations in the CoRoT light curve in the range between 0 and 4.5 c d(-1). The main frequencies are also recovered in the spectroscopic data. In particular we find that HD51452 undergoes gravito-inertial modes that are not in the domain of those excited by the kappa-mechanism. We propose that these are stochastic modes excited in the convective zones and that at least some of them are a multiplet of r-modes (i.e. subinertial modes mainly driven by the Coriolis acceleration). Stochastically excited gravito-inertial modes had never been observed in any star, and theory predicted that their very low amplitudes would be undetectable even with CoRoT. We suggest that the amplitudes are enhanced in HD51452 because of the very rapid stellar rotation. In addition, we find that the amplitude variations of these modes are related to the occurrence of minor outbursts. Conclusions. Thanks to CoRoT data, we have detected a new kind of pulsations in HD51452, which are stochastically excited gravito-inertial modes, probably due to its very rapid rotation. These modes are probably also present in other rapidly rotating hot Be stars.