931 resultados para large data sets
Resumo:
O atual modelo do setor elétrico brasileiro permite igualdade de condições a todos os agentes e reduz o papel do Estado no setor. Esse modelo obriga as empresas do setor a melhorarem cada vez mais a qualidade de seu produto e, como requisito para este objetivo, devem fazer uso mais efetivo da enorme quantidade de dados operacionais que são armazenados em bancos de dados, provenientes da operação dos seus sistemas elétricos e que tem nas Usinas Hidrelétricas (UHE) a sua principal fonte de geração de energia. Uma das principais ferramentas para gerenciamento dessas usinas são os sistemas de Supervisão, Controle e Aquisição de Dados (Supervisory Control And Data Acquisition - SCADA). Assim, a imensa quantidade de dados acumulados nos bancos de dados pelos sistemas SCADA, muito provavelmente contendo informações relevantes, deve ser tratada para descobrir relações e padrões e assim ajudar na compreensão de muitos aspectos operacionais importantes e avaliar o desempenho dos sistemas elétricos de potência. O processo de Descoberta de Conhecimento em Banco de Dados (Knowledge Discovery in Database - KDD) é o processo de identificar, em grandes conjuntos de dados, padrões que sejam válidos, novos, úteis e compreensíveis, para melhorar o entendimento de um problema ou um procedimento de tomada de decisão. A Mineração de Dados (ou Data Mining) é o passo dentro do KDD que permite extrair informações úteis em grandes bases de dados. Neste cenário, o presente trabalho se propõe a realizar experimentos de mineração de dados nos dados gerados por sistemas SCADA em UHE, a fim de produzir informações relevantes para auxiliar no planejamento, operação, manutenção e segurança das hidrelétricas e na implantação da cultura da mineração de dados aplicada a estas usinas.
Resumo:
O uso da técnica da camada equivalente na interpolação de dados de campo potencial permite levar em consideração que a anomalia, gravimétrica ou magnética, a ser interpolada é uma função harmônica. Entretanto, esta técnica tem aplicação computacional restrita aos levantamentos com pequeno número de dados, uma vez que ela exige a solução de um problema de mínimos quadrados com ordem igual a este número. Para viabilizar a aplicação da técnica da camada equivalente aos levantamentos com grande número de dados, nós desenvolvemos o conceito de observações equivalentes e o método EGTG, que, respectivamente, diminui a demanda em memória do computador e otimiza as avaliações dos produtos internos inerentes à solução dos problemas de mínimos quadrados. Basicamente, o conceito de observações equivalentes consiste em selecionar algumas observações, entre todas as observações originais, tais que o ajuste por mínimos quadrados, que ajusta as observações selecionadas, ajusta automaticamente (dentro de um critério de tolerância pré-estabelecido) todas as demais que não foram escolhidas. As observações selecionadas são denominadas observações equivalentes e as restantes são denominadas observações redundantes. Isto corresponde a partir o sistema linear original em dois sistemas lineares com ordens menores. O primeiro com apenas as observações equivalentes e o segundo apenas com as observações redundantes, de tal forma que a solução de mínimos quadrados, obtida a partir do primeiro sistema linear, é também a solução do segundo sistema. Este procedimento possibilita ajustar todos os dados amostrados usando apenas as observações equivalentes (e não todas as observações originais) o que reduz a quantidade de operações e a utilização de memória pelo computador. O método EGTG consiste, primeiramente, em identificar o produto interno como sendo uma integração discreta de uma integral analítica conhecida e, em seguida, em substituir a integração discreta pela avaliação do resultado da integral analítica. Este método deve ser aplicado quando a avaliação da integral analítica exigir menor quantidade de cálculos do que a exigida para computar a avaliação da integral discreta. Para determinar as observações equivalentes, nós desenvolvemos dois algoritmos iterativos denominados DOE e DOEg. O primeiro algoritmo identifica as observações equivalentes do sistema linear como um todo, enquanto que o segundo as identifica em subsistemas disjuntos do sistema linear original. Cada iteração do algoritmo DOEg consiste de uma aplicação do algoritmo DOE em uma partição do sistema linear original. Na interpolação, o algoritmo DOE fornece uma superfície interpoladora que ajusta todos os dados permitindo a interpolação na forma global. O algoritmo DOEg, por outro lado, otimiza a interpolação na forma local uma vez que ele emprega somente as observações equivalentes, em contraste com os algoritmos existentes para a interpolação local que empregam todas as observações. Os métodos de interpolação utilizando a técnica da camada equivalente e o método da mínima curvatura foram comparados quanto às suas capacidades de recuperar os valores verdadeiros da anomalia durante o processo de interpolação. Os testes utilizaram dados sintéticos (produzidos por modelos de fontes prismáticas) a partir dos quais os valores interpolados sobre a malha regular foram obtidos. Estes valores interpolados foram comparados com os valores teóricos, calculados a partir do modelo de fontes sobre a mesma malha, permitindo avaliar a eficiência do método de interpolação em recuperar os verdadeiros valores da anomalia. Em todos os testes realizados o método da camada equivalente recuperou mais fielmente o valor verdadeiro da anomalia do que o método da mínima curvatura. Particularmente em situações de sub-amostragem, o método da mínima curvatura se mostrou incapaz de recuperar o valor verdadeiro da anomalia nos lugares em que ela apresentou curvaturas mais pronunciadas. Para dados adquiridos em níveis diferentes o método da mínima curvatura apresentou o seu pior desempenho, ao contrário do método da camada equivalente que realizou, simultaneamente, a interpolação e o nivelamento. Utilizando o algoritmo DOE foi possível aplicar a técnica da camada equivalente na interpolação (na forma global) dos 3137 dados de anomalia ar-livre de parte do levantamento marinho Equant-2 e 4941 dados de anomalia magnética de campo total de parte do levantamento aeromagnético Carauari-Norte. Os números de observações equivalentes identificados em cada caso foram, respectivamente, iguais a 294 e 299. Utilizando o algoritmo DOEg nós otimizamos a interpolação (na forma local) da totalidade dos dados de ambos os levantamentos citados. Todas as interpolações realizadas não seriam possíveis sem a aplicação do conceito de observações equivalentes. A proporção entre o tempo de CPU (rodando os programas no mesmo espaço de memória) gasto pelo método da mínima curvatura e pela camada equivalente (interpolação global) foi de 1:31. Esta razão para a interpolação local foi praticamente de 1:1.
Resumo:
Pós-graduação em Genética e Melhoramento Animal - FCAV
Resumo:
It is thought that speciation in phytophagous insects is often due to colonization of novel host plants, because radiations of plant and insect lineages are typically asynchronous. Recent phylogenetic comparisons have supported this model of diversification for both insect herbivores and specialized pollinators. An exceptional case where contemporaneous plant-insect diversification might be expected is the obligate mutualism between fig trees (Ficus species, Moraceae) and their pollinating wasps (Agaonidae, Hymenoptera). The ubiquity and ecological significance of this mutualism in tropical and subtropical ecosystems has long intrigued biologists, but the systematic challenge posed by >750 interacting species pairs has hindered progress toward understanding its evolutionary history. In particular, taxon sampling and analytical tools have been insufficient for large-scale cophylogenetic analyses. Here, we sampled nearly 200 interacting pairs of fig and wasp species from across the globe. Two supermatrices were assembled: on an average, wasps had sequences from 77% of 6 genes (5.6 kb), figs had sequences from 60% of 5 genes (5.5 kb), and overall 850 new DNA sequences were generated for this study. We also developed a new analytical tool, Jane 2, for event-based phylogenetic reconciliation analysis of very large data sets. Separate Bayesian phylogenetic analyses for figs and fig wasps under relaxed molecular clock assumptions indicate Cretaceous diversification of crown groups and contemporaneous divergence for nearly half of all fig and pollinator lineages. Event-based cophylogenetic analyses further support the codiversification hypothesis. Biogeographic analyses indicate that the present-day distribution of fig and pollinator lineages is consistent with a Eurasian origin and subsequent dispersal, rather than with Gondwanan vicariance. Overall, our findings indicate that the fig-pollinator mutualism represents an extreme case among plant-insect interactions of coordinated dispersal and long-term codiversification.
Resumo:
This paper describes informatics for cross-sample analysis with comprehensive two-dimensional gas chromatography (GCxGC) and high-resolution mass spectrometry (HRMS). GCxGC-HRMS analysis produces large data sets that are rich with information, but highly complex. The size of the data and volume of information requires automated processing for comprehensive cross-sample analysis, but the complexity poses a challenge for developing robust methods. The approach developed here analyzes GCxGC-HRMS data from multiple samples to extract a feature template that comprehensively captures the pattern of peaks detected in the retention-times plane. Then, for each sample chromatogram, the template is geometrically transformed to align with the detected peak pattern and generate a set of feature measurements for cross-sample analyses such as sample classification and biomarker discovery. The approach avoids the intractable problem of comprehensive peak matching by using a few reliable peaks for alignment and peak-based retention-plane windows to define comprehensive features that can be reliably matched for cross-sample analysis. The informatics are demonstrated with a set of 18 samples from breast-cancer tumors, each from different individuals, six each for Grades 1-3. The features allow classification that matches grading by a cancer pathologist with 78% success in leave-one-out cross-validation experiments. The HRMS signatures of the features of interest can be examined for determining elemental compositions and identifying compounds.
Resumo:
Deep tissue imaging has become state of the art in biology, but now the problem is to quantify spatial information in a global, organ-wide context. Although access to the raw data is no longer a limitation, the computational tools to extract biologically useful information out of these large data sets is still catching up. In many cases, to understand the mechanism behind a biological process, where molecules or cells interact with each other, it is mandatory to know their mutual positions. We illustrate this principle here with the immune system. Although the general functions of lymph nodes as immune sentinels are well described, many cellular and molecular details governing the interactions of lymphocytes and dendritic cells remain unclear to date and prevent an in-depth mechanistic understanding of the immune system. We imaged ex vivo lymph nodes isolated from both wild-type and transgenic mice lacking key factors for dendritic cell positioning and used software written in MATLAB to determine the spatial distances between the dendritic cells and the internal high endothelial vascular network. This allowed us to quantify the spatial localization of the dendritic cells in the lymph node, which is a critical parameter determining the effectiveness of an adaptive immune response.
Resumo:
In this research the supportive role of the family in coping with everyday problems was studied using two large data sets. The results show the importance of the structural aspect of social support. Mapping individual preferences to support referents showed the crucial role of spouse and parents in solving everyday problems. The individual choices of particular support referents could be fairly accurately predicted from knowledge of the composition of the family, in both categorical regression and logit models. The far lower predictability of the criterion variable was shown using a wide range of socioeconomic, social and demographic indicators. Residence in small cities and indicators of extreme occupational strata were particularly predictive of the choice of support referent. The supportive role of the family was also traced in the personal projects of young adults, which were seen as ecological, natural and dynamic middle-level units of analysis of personality. Different aspects of personal projects, including reliance on social support referents, turned out to be highly interrelated. One the one hand, expectations of support were determined by the content of the project, and on the other, expected social support also influences the content of the project. Sivuha sees this as one of the ways others can enter self-structures.
Resumo:
This article explores societal culture as an antecedent of public service motivation. Culture can be a major factor in developing an institution-based theory of public service motivation. In the field of organization theory, culture is considered a fundamental factor for explaining organization behavior. But our review of the literature reveals that culture has not been fully integrated into public service motivation theory or carefully investigated in this research stream. This study starts to fill this gap in the literature by using institutionalism and social-identity theory to predict how the sub-national Germanic and Latin cultures of Switzerland, which are measured through the mother tongues of public employees and the regional locations of public offices, affect their levels of public service motivation. Our analysis centers on two large data sets of federal and municipal employees, and produces evidence that culture has a consistent impact on public service motivation. The results show that Swiss German public employees have a significantly higher level of public service motivation on the whole, while Swiss French public employees have a significantly lower level overall. Implications for theory development and future research are discussed.
Resumo:
While ecological monitoring and biodiversity assessment programs are widely implemented and relatively well developed to survey and monitor the structure and dynamics of populations and communities in many ecosystems, quantitative assessment and monitoring of genetic and phenotypic diversity that is important to understand evolutionary dynamics is only rarely integrated. As a consequence, monitoring programs often fail to detect changes in these key components of biodiversity until after major loss of diversity has occurred. The extensive efforts in ecological monitoring have generated large data sets of unique value to macro-scale and long-term ecological research, but the insights gained from such data sets could be multiplied by the inclusion of evolutionary biological approaches. We argue that the lack of process-based evolutionary thinking in ecological monitoring means a significant loss of opportunity for research and conservation. Assessment of genetic and phenotypic variation within and between species needs to be fully integrated to safeguard biodiversity and the ecological and evolutionary dynamics in natural ecosystems. We illustrate our case with examples from fishes and conclude with examples of ongoing monitoring programs and provide suggestions on how to improve future quantitative diversity surveys.
Resumo:
The new computing paradigm known as cognitive computing attempts to imitate the human capabilities of learning, problem solving, and considering things in context. To do so, an application (a cognitive system) must learn from its environment (e.g., by interacting with various interfaces). These interfaces can run the gamut from sensors to humans to databases. Accessing data through such interfaces allows the system to conduct cognitive tasks that can support humans in decision-making or problem-solving processes. Cognitive systems can be integrated into various domains (e.g., medicine or insurance). For example, a cognitive system in cities can collect data, can learn from various data sources and can then attempt to connect these sources to provide real time optimizations of subsystems within the city (e.g., the transportation system). In this study, we provide a methodology for integrating a cognitive system that allows data to be verbalized, making the causalities and hypotheses generated from the cognitive system more understandable to humans. We abstract a city subsystem—passenger flow for a taxi company—by applying fuzzy cognitive maps (FCMs). FCMs can be used as a mathematical tool for modeling complex systems built by directed graphs with concepts (e.g., policies, events, and/or domains) as nodes and causalities as edges. As a verbalization technique we introduce the restriction-centered theory of reasoning (RCT). RCT addresses the imprecision inherent in language by introducing restrictions. Using this underlying combinatorial design, our approach can handle large data sets from complex systems and make the output understandable to humans.
Resumo:
Intransitive competition networks, those in which there is no single best competitor, may ensure species coexistence. However, their frequency and importance in maintaining diversity in real-world ecosystems remain unclear. We used two large data sets from drylands and agricultural grasslands to assess: (1) the generality of intransitive competition, (2) intransitivity–richness relationships and (3) effects of two major drivers of biodiversity loss (aridity and land-use intensification) on intransitivity and species richness. Intransitive competition occurred in > 65% of sites and was associated with higher species richness. Intransitivity increased with aridity, partly buffering its negative effects on diversity, but was decreased by intensive land use, enhancing its negative effects on diversity. These contrasting responses likely arise because intransitivity is promoted by temporal heterogeneity, which is enhanced by aridity but may decline with land-use intensity. We show that intransitivity is widespread in nature and increases diversity, but it can be lost with environmental homogenisation.
Resumo:
Currently several thousands of objects are being tracked in the MEO and GEO regions through optical means. The problem faced in this framework is that of Multiple Target Tracking (MTT). In this context both, the correct associations among the observations and the orbits of the objects have to be determined. The complexity of the MTT problem is defined by its dimension S. The number S corresponds to the number of fences involved in the problem. Each fence consists of a set of observations where each observation belongs to a different object. The S ≥ 3 MTT problem is an NP-hard combinatorial optimization problem. There are two general ways to solve this. One way is to seek the optimum solution, this can be achieved by applying a branch-and- bound algorithm. When using these algorithms the problem has to be greatly simplified to keep the computational cost at a reasonable level. Another option is to approximate the solution by using meta-heuristic methods. These methods aim to efficiently explore the different possible combinations so that a reasonable result can be obtained with a reasonable computational effort. To this end several population-based meta-heuristic methods are implemented and tested on simulated optical measurements. With the advent of improved sensors and a heightened interest in the problem of space debris, it is expected that the number of tracked objects will grow by an order of magnitude in the near future. This research aims to provide a method that can treat the correlation and orbit determination problems simultaneously, and is able to efficiently process large data sets with minimal manual intervention.
Resumo:
Currently several thousands of objects are being tracked in the MEO and GEO regions through optical means. The problem faced in this framework is that of Multiple Target Tracking (MTT). In this context both the correct associations among the observations, and the orbits of the objects have to be determined. The complexity of the MTT problem is defined by its dimension S. Where S stands for the number of ’fences’ used in the problem, each fence consists of a set of observations that all originate from dierent targets. For a dimension of S ˃ the MTT problem becomes NP-hard. As of now no algorithm exists that can solve an NP-hard problem in an optimal manner within a reasonable (polynomial) computation time. However, there are algorithms that can approximate the solution with a realistic computational e ort. To this end an Elitist Genetic Algorithm is implemented to approximately solve the S ˃ MTT problem in an e cient manner. Its complexity is studied and it is found that an approximate solution can be obtained in a polynomial time. With the advent of improved sensors and a heightened interest in the problem of space debris, it is expected that the number of tracked objects will grow by an order of magnitude in the near future. This research aims to provide a method that can treat the correlation and orbit determination problems simultaneously, and is able to e ciently process large data sets with minimal manual intervention.
Resumo:
A number of analyses of large data sets have suggested that the reading achievement gap between African American and White U.S. is negligible or small at school entry, but widens substantially during the school years because African American students show slower rates of growth in elementary and secondary school. Identifying when and why gaps occur, therefore, is a an important research endeavor. In addition, being able to predict which African American children are most likely to fall behind can contribute to efforts to close the achievement gap. This paper analyzes first grade and third grade data on African American and White children in Massachusetts who all were identified in first grade as struggling readers and enrolled in Reading Recovery—an individualized intervention. All the children were low-income and attending urban schools. Using Observation Survey data from first grade, and MCAS Reading data from 3rd grade, we found that the African American and White students made equal average progress while in first grade, but by the end of third grade showed a large gap in MCAS proficiency rates. We discuss the results in terms of school quality, reading development, dialect issues, testing formats, and the need to provide long-term support to vulnerable learners.
Resumo:
Logistic regression is one of the most important tools in the analysis of epidemiological and clinical data. Such data often contain missing values for one or more variables. Common practice is to eliminate all individuals for whom any information is missing. This deletion approach does not make efficient use of available information and often introduces bias.^ Two methods were developed to estimate logistic regression coefficients for mixed dichotomous and continuous covariates including partially observed binary covariates. The data were assumed missing at random (MAR). One method (PD) used predictive distribution as weight to calculate the average of the logistic regressions performing on all possible values of missing observations, and the second method (RS) used a variant of resampling technique. Additional seven methods were compared with these two approaches in a simulation study. They are: (1) Analysis based on only the complete cases, (2) Substituting the mean of the observed values for the missing value, (3) An imputation technique based on the proportions of observed data, (4) Regressing the partially observed covariates on the remaining continuous covariates, (5) Regressing the partially observed covariates on the remaining continuous covariates conditional on response variable, (6) Regressing the partially observed covariates on the remaining continuous covariates and response variable, and (7) EM algorithm. Both proposed methods showed smaller standard errors (s.e.) for the coefficient involving the partially observed covariate and for the other coefficients as well. However, both methods, especially PD, are computationally demanding; thus for analysis of large data sets with partially observed covariates, further refinement of these approaches is needed. ^