27 resultados para Data pre-processing
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo (BDPI/USP)
Resumo:
Due to the imprecise nature of biological experiments, biological data is often characterized by the presence of redundant and noisy data. This may be due to errors that occurred during data collection, such as contaminations in laboratorial samples. It is the case of gene expression data, where the equipments and tools currently used frequently produce noisy biological data. Machine Learning algorithms have been successfully used in gene expression data analysis. Although many Machine Learning algorithms can deal with noise, detecting and removing noisy instances from the training data set can help the induction of the target hypothesis. This paper evaluates the use of distance-based pre-processing techniques for noise detection in gene expression data classification problems. This evaluation analyzes the effectiveness of the techniques investigated in removing noisy data, measured by the accuracy obtained by different Machine Learning classifiers over the pre-processed data.
Resumo:
Today several different unsupervised classification algorithms are commonly used to cluster similar patterns in a data set based only on its statistical properties. Specially in image data applications, self-organizing methods for unsupervised classification have been successfully applied for clustering pixels or group of pixels in order to perform segmentation tasks. The first important contribution of this paper refers to the development of a self-organizing method for data classification, named Enhanced Independent Component Analysis Mixture Model (EICAMM), which was built by proposing some modifications in the Independent Component Analysis Mixture Model (ICAMM). Such improvements were proposed by considering some of the model limitations as well as by analyzing how it should be improved in order to become more efficient. Moreover, a pre-processing methodology was also proposed, which is based on combining the Sparse Code Shrinkage (SCS) for image denoising and the Sobel edge detector. In the experiments of this work, the EICAMM and other self-organizing models were applied for segmenting images in their original and pre-processed versions. A comparative analysis showed satisfactory and competitive image segmentation results obtained by the proposals presented herein. (C) 2008 Published by Elsevier B.V.
Resumo:
In this paper, we propose a method based on association rule-mining to enhance the diagnosis of medical images (mammograms). It combines low-level features automatically extracted from images and high-level knowledge from specialists to search for patterns. Our method analyzes medical images and automatically generates suggestions of diagnoses employing mining of association rules. The suggestions of diagnosis are used to accelerate the image analysis performed by specialists as well as to provide them an alternative to work on. The proposed method uses two new algorithms, PreSAGe and HiCARe. The PreSAGe algorithm combines, in a single step, feature selection and discretization, and reduces the mining complexity. Experiments performed on PreSAGe show that this algorithm is highly suitable to perform feature selection and discretization in medical images. HiCARe is a new associative classifier. The HiCARe algorithm has an important property that makes it unique: it assigns multiple keywords per image to suggest a diagnosis with high values of accuracy. Our method was applied to real datasets, and the results show high sensitivity (up to 95%) and accuracy (up to 92%), allowing us to claim that the use of association rules is a powerful means to assist in the diagnosing task.
Resumo:
This work proposes a method based on both preprocessing and data mining with the objective of identify harmonic current sources in residential consumers. In addition, this methodology can also be applied to identify linear and nonlinear loads. It should be emphasized that the entire database was obtained through laboratory essays, i.e., real data were acquired from residential loads. Thus, the residential system created in laboratory was fed by a configurable power source and in its output were placed the loads and the power quality analyzers (all measurements were stored in a microcomputer). So, the data were submitted to pre-processing, which was based on attribute selection techniques in order to minimize the complexity in identifying the loads. A newer database was generated maintaining only the attributes selected, thus, Artificial Neural Networks were trained to realized the identification of loads. In order to validate the methodology proposed, the loads were fed both under ideal conditions (without harmonics), but also by harmonic voltages within limits pre-established. These limits are in accordance with IEEE Std. 519-1992 and PRODIST (procedures to delivery energy employed by Brazilian`s utilities). The results obtained seek to validate the methodology proposed and furnish a method that can serve as alternative to conventional methods.
Resumo:
This study presents a solid-like finite element formulation to solve geometric non-linear three-dimensional inhomogeneous frames. To achieve the desired representation, unconstrained vectors are used instead of the classic rigid director triad; as a consequence, the resulting formulation does not use finite rotation schemes. High order curved elements with any cross section are developed using a full three-dimensional constitutive elastic relation. Warping and variable thickness strain modes are introduced to avoid locking. The warping mode is solved numerically in FEM pre-processing computational code, which is coupled to the main program. The extra calculations are relatively small when the number of finite elements. with the same cross section, increases. The warping mode is based on a 2D free torsion (Saint-Venant) problem that considers inhomogeneous material. A scheme that automatically generates shape functions and its derivatives allow the use of any degree of approximation for the developed frame element. General examples are solved to check the objectivity, path independence, locking free behavior, generality and accuracy of the proposed formulation. (C) 2009 Elsevier B.V. All rights reserved.
Resumo:
The amount of textual information digitally stored is growing every day. However, our capability of processing and analyzing that information is not growing at the same pace. To overcome this limitation, it is important to develop semiautomatic processes to extract relevant knowledge from textual information, such as the text mining process. One of the main and most expensive stages of the text mining process is the text pre-processing stage, where the unstructured text should be transformed to structured format such as an attribute-value table. The stemming process, i.e. linguistics normalization, is usually used to find the attributes of this table. However, the stemming process is strongly dependent on the language in which the original textual information is given. Furthermore, for most languages, the stemming algorithms proposed in the literature are computationally expensive. In this work, several improvements of the well know Porter stemming algorithm for the Portuguese language, which explore the characteristics of this language, are proposed. Experimental results show that the proposed algorithm executes in far less time without affecting the quality of the generated stems.
Resumo:
Maltose-binding protein is the periplasmic component of the ABC transporter responsible for the uptake of maltose/maltodextrins. The Xanthomonas axonopodis pv. citri maltose-binding protein MalE has been crystallized at 293 Kusing the hanging-drop vapour-diffusion method. The crystal belonged to the primitive hexagonal space group P6(1)22, with unit-cell parameters a = 123.59, b = 123.59, c = 304.20 angstrom, and contained two molecules in the asymetric unit. It diffracted to 2.24 angstrom resolution.
Resumo:
In eukaryotes, pre-rRNA processing depends on a large number of nonribosomal trans-acting factors that form intriguingly organized complexes. One of the early stages of pre-rRNA processing includes formation of the two intermediate complexes pre-40S and pre-60S, which then form the mature ribosome subunits. Each of these complexes contains specific pre-rRNAs, ribosomal proteins and processing factors. The yeast nucleolar protein Nop53p has previously been identified in the pre-60S complex and shown to affect pre-rRNA processing by directly binding to 5.8S rRNA, and to interact with Nop17p and Nip7p, which are also involved in this process. Here we show that Nop53p binds 5.8S rRNA co-transcriptionally through its N-terminal region, and that this protein portion can also partially complement growth of the conditional mutant strain Delta nop53/GAL:NOP53. Nop53p interacts with Rrp6p and activates the exosome in vitro. These results indicate that Nop53p may recruit the exosome to 7S pre-rRNA for processing. Consistent with this observation and similar to the observed in exosome mutants, depletion of Nop53p leads to accumulation of polyadenylated pre-rRNAs.
Cwc24p, a novel Saccharomyces cerevisiae nuclear ring finger protein, affects pre-snoRNA U3 splicing
Resumo:
U3 snoRNA is transcribed from two intron-containing genes in yeast, snR17A and snR17B. Although the assembly of the U3 snoRNP has not been precisely determined, at least some of the core box C/D proteins are known to bind pre-U3 co-transcriptionally, thereby affecting splicing and 3 `-end processing of this snoRNA. We identified the interaction between the box C/D assembly factor Nop17p and Cwc24p, a novel yeast RING finger protein that had been previously isolated in a complex with the splicing factor Cef1p. Here we show that, consistent with the protein interaction data, Cwc24p localizes to the cell nucleus, and its depletion leads to the accumulation of both U3 pre-snoRNAs. U3 snoRNA is involved in the early cleavages of 35 S pre-rRNA, and the defective splicing of pre-U3 detected in cells depleted of Cwc24p causes the accumulation of the 35 S precursor rRNA. These results led us to the conclusion that Cwc 24p is involved in pre-U3 snoRNA splicing, indirectly affecting pre-rRNA processing.
Resumo:
In eukaryotes, pre-rRNA processing depends on a large number of nonribosomal trans-acting factors that form intriguingly organized complexes. Two intermediate complexes, pre-40S and pre-60S, are formed at the early stages of 35S pre-rRNA processing and give rise to the mature ribosome subunits. Each of these complexes contains specific pre-rRNAs, some ribosomal proteins and processing factors. The novel yeast protein Utp25p has previously been identified in the nucleolus, an indication that this protein could be involved in ribosome biogenesis. Here we show that Utp25p interacts with the SSU processome proteins Sas10p and Mpp10p, and affects 18S rRNA maturation. Depletion of Utp25p leads to accumulation of the pre-rRNA 35S and the aberrant rRNA 23S, and to a severe reduction in 40S ribosomal subunit levels. Our results indicate that Utp25p is a novel SSU processome subunit involved in pre-40S maturation.
Resumo:
The Shwachman-Bodian-Diamond syndrome protein (SBDS) is a member of a highly conserved protein family of not well understood function, with putative orthologues found in different organisms ranging from Archaea, yeast and plants to vertebrate animals. The yeast orthologue of SBDS, Sdo1p, has been previously identified in association with the 60S ribosomal subunit and is proposed to participate in ribosomal recycling. Here we show that Sdo1p interacts with nucleolar rRNA processing factors and ribosomal proteins, indicating that it might bind the pre-60S complex and remain associated with it during processing and transport to the cytoplasm. Corroborating the protein interaction data, Sdo1p localizes to the nucleus and cytoplasm and co-immunoprecipitates precursors of 60S and 40S subunits, as well as the mature rRNAs. Sdo1p binds RNA directly, suggesting that it may associate with the ribosomal subunits also through RNA interaction. Copyright (C) 2009 John Wiley & Sons, Ltd.
Resumo:
Geographic Data Warehouses (GDW) are one of the main technologies used in decision-making processes and spatial analysis, and the literature proposes several conceptual and logical data models for GDW. However, little effort has been focused on studying how spatial data redundancy affects SOLAP (Spatial On-Line Analytical Processing) query performance over GDW. In this paper, we investigate this issue. Firstly, we compare redundant and non-redundant GDW schemas and conclude that redundancy is related to high performance losses. We also analyze the issue of indexing, aiming at improving SOLAP query performance on a redundant GDW. Comparisons of the SB-index approach, the star-join aided by R-tree and the star-join aided by GiST indicate that the SB-index significantly improves the elapsed time in query processing from 25% up to 99% with regard to SOLAP queries defined over the spatial predicates of intersection, enclosure and containment and applied to roll-up and drill-down operations. We also investigate the impact of the increase in data volume on the performance. The increase did not impair the performance of the SB-index, which highly improved the elapsed time in query processing. Performance tests also show that the SB-index is far more compact than the star-join, requiring only a small fraction of at most 0.20% of the volume. Moreover, we propose a specific enhancement of the SB-index to deal with spatial data redundancy. This enhancement improved performance from 80 to 91% for redundant GDW schemas.
Resumo:
This paper describes a new food classification which assigns foodstuffs according to the extent and purpose of the industrial processing applied to them. Three main groups are defined: unprocessed or minimally processed foods (group 1), processed culinary and food industry ingredients (group 2), and ultra-processed food products (group 3). The use of this classification is illustrated by applying it to data collected in the Brazilian Household Budget Survey which was conducted in 2002/2003 through a probabilistic sample of 48,470 Brazilian households. The average daily food availability was 1,792 kcal/person being 42.5% from group 1 (mostly rice and beans and meat and milk), 37.5% from group 2 (mostly vegetable oils, sugar, and flours), and 20% from group 3 (mostly breads, biscuits, sweets, soft drinks, and sausages). The share of group 3 foods increased with income, and represented almost one third of all calories in higher income households. The impact of the replacement of group 1 foods and group 2 ingredients by group 3 products on the overall quality of the diet, eating patterns and health is discussed.
Resumo:
O preá do semiárido nordestino (Galea spixii Wagler, 1831) é um roedor pertencente à família Caviidae. Pouca literatura é encontrada sobre essa espécie em relação a sua morfologia e seu comportamento ambiental e reprodutivo. Com o objetivo de entender a morfologia geral, em foco, na inervação do membro pélvico dessa espécie, neste trabalho, foi explorado o nervo isquiático, o qual é o maior de todos os nervos do organismo. Foram utilizados 10 preás (cinco machos e cinco fêmeas) que vieram a óbito por causas naturais, oriundos do Centro de Multiplicação de Animais Silvestres da Universidade Federal Rural do Semiárido (CEMAS/UFERSA). Os animais foram fixados após o óbito em solução aquosa de formaldeído 10% e, após 48 horas de imersão nessa solução, foram dissecados para exposição do nervo isquiático. Dessa forma, os dados obtidos foram compilados em tabelas e expressos em desenhos esquemáticos e fotografias. Os pares de nervos isquiáticos originaram-se de raízes ventrais de L6L7S1 (70%) e de L7S1S2 (30%) e distribuíram-se pelos músculos glúteo profundo, bíceps femural, semitendinoso e semimembranoso.
Resumo:
Orthodox teaching and practice on nutrition and health almost always focuses on nutrients, or else on foods and drinks. Thus, diets that are high in folate and in green leafy vegetables are recommended, whereas diets high in saturated fat and in full-fat milk and other dairy products are not recommended. Food guides such as the US Food Guide Pyramid are designed to encourage consumption of healthier foods, by which is usually meant those higher in vitamins, minerals and other nutrients seen as desirable.What is generally overlooked in such approaches, which currently dominate official and other authoritative information and education programmes, and also food and nutrition public health policies, is food processing. It is now generally acknowledged that the current pandemic of obesity and related chronic diseases has as one of its important causes increased consumption of convenience including pre-prepared foods(1,2). However, the issue of food processing is largely ignored or minimised in education and information about food, nutrition and health, and also in public health policies.A short commentary cannot be comprehensive, and a general proposal such as that made here is bound to have some problems and exceptions. Also, the social, cultural, economic and environmental consequences of food processing are not discussed here. Readers comments and queries are invited