858 resultados para Data pre-processing
Resumo:
To date, big data applications have focused on the store-and-process paradigm. In this paper we describe an initiative to deal with big data applications for continuous streams of events. In many emerging applications, the volume of data being streamed is so large that the traditional ‘store-then-process’ paradigm is either not suitable or too inefficient. Moreover, soft-real time requirements might severely limit the engineering solutions. Many scenarios fit this description. In network security for cloud data centres, for instance, very high volumes of IP packets and events from sensors at firewalls, network switches and routers and servers need to be analyzed and should detect attacks in minimal time, in order to limit the effect of the malicious activity over the IT infrastructure. Similarly, in the fraud department of a credit card company, payment requests should be processed online and need to be processed as quickly as possible in order to provide meaningful results in real-time. An ideal system would detect fraud during the authorization process that lasts hundreds of milliseconds and deny the payment authorization, minimizing the damage to the user and the credit card company.
Resumo:
The Microarray technique is rather powerful, as it allows to test up thousands of genes at a time, but this produces an overwhelming set of data files containing huge amounts of data, which is quite difficult to pre-process, separate, classify and correlate for interesting conclusions to be extracted. Modern machine learning, data mining and clustering techniques based on information theory, are needed to read and interpret the information contents buried in those large data sets. Independent Component Analysis method can be used to correct the data affected by corruption processes or to filter the uncorrectable one and then clustering methods can group similar genes or classify samples. In this paper a hybrid approach is used to obtain a two way unsupervised clustering for a corrected microarray data.
Resumo:
This paper proposes the optimization relaxation approach based on the analogue Hopfield Neural Network (HNN) for cluster refinement of pre-classified Polarimetric Synthetic Aperture Radar (PolSAR) image data. We consider the initial classification provided by the maximum-likelihood classifier based on the complex Wishart distribution, which is then supplied to the HNN optimization approach. The goal is to improve the classification results obtained by the Wishart approach. The classification improvement is verified by computing a cluster separability coefficient and a measure of homogeneity within the clusters. During the HNN optimization process, for each iteration and for each pixel, two consistency coefficients are computed, taking into account two types of relations between the pixel under consideration and its corresponding neighbors. Based on these coefficients and on the information coming from the pixel itself, the pixel under study is re-classified. Different experiments are carried out to verify that the proposed approach outperforms other strategies, achieving the best results in terms of separability and a trade-off with the homogeneity preserving relevant structures in the image. The performance is also measured in terms of computational central processing unit (CPU) times.
Resumo:
Due to the advancement of both, information technology in general, and databases in particular; data storage devices are becoming cheaper and data processing speed is increasing. As result of this, organizations tend to store large volumes of data holding great potential information. Decision Support Systems, DSS try to use the stored data to obtain valuable information for organizations. In this paper, we use both data models and use cases to represent the functionality of data processing in DSS following Software Engineering processes. We propose a methodology to develop DSS in the Analysis phase, respective of data processing modeling. We have used, as a starting point, a data model adapted to the semantics involved in multidimensional databases or data warehouses, DW. Also, we have taken an algorithm that provides us with all the possible ways to automatically cross check multidimensional model data. Using the aforementioned, we propose diagrams and descriptions of use cases, which can be considered as patterns representing the DSS functionality, in regard to DW data processing, DW on which DSS are based. We highlight the reusability and automation benefits that this can be achieved, and we think this study can serve as a guide in the development of DSS.
Resumo:
Due to the relative transparency of its embryos and larvae, the zebrafish is an ideal model organism for bioimaging approaches in vertebrates. Novel microscope technologies allow the imaging of developmental processes in unprecedented detail, and they enable the use of complex image-based read-outs for high-throughput/high-content screening. Such applications can easily generate Terabytes of image data, the handling and analysis of which becomes a major bottleneck in extracting the targeted information. Here, we describe the current state of the art in computational image analysis in the zebrafish system. We discuss the challenges encountered when handling high-content image data, especially with regard to data quality, annotation, and storage. We survey methods for preprocessing image data for further analysis, and describe selected examples of automated image analysis, including the tracking of cells during embryogenesis, heartbeat detection, identification of dead embryos, recognition of tissues and anatomical landmarks, and quantification of behavioral patterns of adult fish. We review recent examples for applications using such methods, such as the comprehensive analysis of cell lineages during early development, the generation of a three-dimensional brain atlas of zebrafish larvae, and high-throughput drug screens based on movement patterns. Finally, we identify future challenges for the zebrafish image analysis community, notably those concerning the compatibility of algorithms and data formats for the assembly of modular analysis pipelines.
Resumo:
In the last years, there has been an increase in the amount of real-time data generated. Sensors attached to things are transforming how we interact with our environment. Extracting meaningful information from these streams of data is essential for some application areas and requires processing systems that scale to varying conditions in data sources, complex queries, and system failures. This paper describes ongoing research on the development of a scalable RDF streaming engine.
Resumo:
Con el auge del Cloud Computing, las aplicaciones de proceso de datos han sufrido un incremento de demanda, y por ello ha cobrado importancia lograr m�ás eficiencia en los Centros de Proceso de datos. El objetivo de este trabajo es la obtenci�ón de herramientas que permitan analizar la viabilidad y rentabilidad de diseñar Centros de Datos especializados para procesamiento de datos, con una arquitectura, sistemas de refrigeraci�ón, etc. adaptados. Algunas aplicaciones de procesamiento de datos se benefician de las arquitecturas software, mientras que en otras puede ser m�ás eficiente un procesamiento con arquitectura hardware. Debido a que ya hay software con muy buenos resultados en el procesamiento de grafos, como el sistema XPregel, en este proyecto se realizará una arquitectura hardware en VHDL, implementando el algoritmo PageRank de Google de forma escalable. Se ha escogido este algoritmo ya que podr��á ser m�ás eficiente en arquitectura hardware, debido a sus características concretas que se indicaráan m�ás adelante. PageRank sirve para ordenar las p�áginas por su relevancia en la web, utilizando para ello la teorí��a de grafos, siendo cada página web un vértice de un grafo; y los enlaces entre páginas, las aristas del citado grafo. En este proyecto, primero se realizará un an�álisis del estado de la técnica. Se supone que la implementaci�ón en XPregel, un sistema de procesamiento de grafos, es una de las m�ás eficientes. Por ello se estudiará esta �ultima implementaci�ón. Sin embargo, debido a que Xpregel procesa, en general, algoritmos que trabajan con grafos; no tiene en cuenta ciertas caracterí��sticas del algoritmo PageRank, por lo que la implementaci�on no es �optima. Esto es debido a que en PageRank, almacenar todos los datos que manda un mismo v�értice es un gasto innecesario de memoria ya que todos los mensajes que manda un vértice son iguales entre sí e iguales a su PageRank. Se realizará el diseño en VHDL teniendo en cuenta esta caracter��ística del citado algoritmo,evitando almacenar varias veces los mensajes que son iguales. Se ha elegido implementar PageRank en VHDL porque actualmente las arquitecturas de los sistemas operativos no escalan adecuadamente. Se busca evaluar si con otra arquitectura se obtienen mejores resultados. Se realizará un diseño partiendo de cero, utilizando la memoria ROM de IPcore de Xillinx (Software de desarrollo en VHDL), generada autom�áticamente. Se considera hacer cuatro tipos de módulos para que as�� el procesamiento se pueda hacer en paralelo. Se simplificar�á la estructura de XPregel con el fin de intentar aprovechar la particularidad de PageRank mencionada, que hace que XPregel no le saque el m�aximo partido. Despu�és se escribirá el c�ódigo, realizando una estructura escalable, ya que en la computación intervienen millones de páginas web. A continuación, se sintetizar�á y se probará el código en una FPGA. El �ultimo paso será una evaluaci�ón de la implementaci�ón, y de posibles mejoras en cuanto al consumo.
Resumo:
Nowadays, devices that monitor the health of structures consume a lot of power and need a lot of time to acquire, process, and send the information about the structure to the main processing unit. To decrease this time, fast electronic devices are starting to be used to accelerate this processing. In this paper some hardware algorithms implemented in an electronic logic programming device are described. The goal of this implementation is accelerate the process and diminish the information that has to be send. By reaching this goal, the time the processor needs for treating all the information is reduced and so the power consumption is reduced too.
Resumo:
7 pages, 4 figures Acknowledgement We are grateful to M. Riedl and G. Ansmann for fruitful discussions and critical comments on earlier versions of the manuscript. This work was supported by the Volkswagen Foundation (Grant Nos. 88461, 88462, 88463, 85390, 85391 and 85392).
Resumo:
The conserved CDC5 family of Myb-related proteins performs an essential function in cell cycle control at G2/M. Although c-Myb and many Myb-related proteins act as transcription factors, herein, we implicate CDC5 proteins in pre-mRNA splicing. Mammalian CDC5 colocalizes with pre-mRNA splicing factors in the nuclei of mammalian cells, associates with core components of the splicing machinery in nuclear extracts, and interacts with the spliceosome throughout the splicing reaction in vitro. Furthermore, genetic depletion of the homolog of CDC5 in Saccharomyces cerevisiae, CEF1, blocks the first step of pre-mRNA processing in vivo. These data provide evidence that eukaryotic cells require CDC5 proteins for pre-mRNA splicing.
Resumo:
Three small nucleolar RNAs (snoRNAs), E1, E2 and E3, have been described that have unique sequences and interact directly with unique segments of pre-rRNA in vivo. In this report, injection of antisense oligodeoxynucleotides into Xenopus laevis oocytes was used to target the specific degradation of these snoRNAs. Specific disruptions of pre-rRNA processing were then observed, which were reversed by injection of the corresponding in vitro-synthesized snoRNA. Degradation of each of these three snoRNAs produced a unique rRNA maturation phenotype. E1 RNA depletion shut down 18 rRNA formation, without overaccumulation of 20S pre-rRNA. After E2 RNA degradation, production of 18S rRNA and 36S pre-rRNA stopped, and 38S pre-rRNA accumulated, without overaccumulation of 20S pre-rRNA. E3 RNA depletion induced the accumulation of 36S pre-rRNA. This suggests that each of these snoRNAs plays a different role in pre-rRNA processing and indicates that E1 and E2 RNAs are essential for 18S rRNA formation. The available data support the proposal that these snoRNAs are at least involved in pre-rRNA processing at the following pre-rRNA cleavage sites: E1 at the 5′ end and E2 at the 3′ end of 18S rRNA, and E3 at or near the 5′ end of 5.8S rRNA.
Resumo:
Previous studies showed that components implicated in pre-rRNA processing, including U3 small nucleolar (sno)RNA, fibrillarin, nucleolin, and proteins B23 and p52, accumulate in perichromosomal regions and in numerous mitotic cytoplasmic particles, termed nucleolus-derived foci (NDF) between early anaphase and late telophase. The latter structures were analyzed for the presence of pre-rRNA by fluorescence in situ hybridization using probes for segments of pre-rRNA with known half-lives. The NDF did not contain the short-lived 5′-external transcribed spacer (ETS) leader segment upstream from the primary processing site in 47S pre-rRNA. However, the NDF contained sequences from the 5′-ETS core, 18S, internal transcribed spacer 1 (ITS1), and 28S segments and also had detectable, but significantly reduced, levels of the 3′-ETS sequence. Northern analyses showed that in mitotic cells, the latter sequences were present predominantly in 45S-46S pre-rRNAs, indicating that high-molecular weight processing intermediates are preserved during mitosis. Two additional essential processing components were also found in the NDF: U8 snoRNA and hPop1 (a protein component of RNase MRP and RNase P). Thus, the NDF appear to be large complexes containing partially processed pre-rRNA associated with processing components in which processing has been significantly suppressed. The NDF may facilitate coordinated assembly of postmitotic nucleoli.
Resumo:
Phosphoinositide signal transduction pathways in nuclei use enzymes that are indistinguishable from their cytosolic analogues. We demonstrate that distinct phosphatidylinositol phosphate kinases (PIPKs), the type I and type II isoforms, are concentrated in nuclei of mammalian cells. The cytosolic and nuclear PIPKs display comparable activities toward the substrates phosphatidylinositol 4-phosphate and phosphatidylinositol 3-phosphate. Indirect immunofluorescence revealed that these kinases were associated with distinct subnuclear domains, identified as “nuclear speckles,” which also contained pre-mRNA processing factors. A pool of nuclear phosphatidylinositol bisphosphate (PIP2), the product of these kinases, was also detected at these same sites by monoclonal antibody staining. The localization of PIPKs and PIP2 to speckles is dynamic in that both PIPKs and PIP2 reorganize along with other speckle components upon inhibition of mRNA transcription. Because PIPKs have roles in the production of most phosphatidylinositol second messengers, these findings demonstrate that phosphatidylinositol signaling pathways are localized at nuclear speckles. Surprisingly, the PIPKs and PIP2 are not associated with invaginations of the nuclear envelope or any nuclear membrane structure. The putative absence of membranes at these sites suggests novel mechanisms for the generation of phosphoinositides within these structures.
Resumo:
We have examined the distribution of RNA transcription and processing factors in the amphibian oocyte nucleus or germinal vesicle. RNA polymerase I (pol I), pol II, and pol III occur in the Cajal bodies (coiled bodies) along with various components required for transcription and processing of the three classes of nuclear transcripts: mRNA, rRNA, and pol III transcripts. Among these components are transcription factor IIF (TFIIF), TFIIS, splicing factors, the U7 small nuclear ribonucleoprotein particle, the stem–loop binding protein, SR proteins, cleavage and polyadenylation factors, small nucleolar RNAs, nucleolar proteins that are probably involved in pre-rRNA processing, and TFIIIA. Earlier studies and data presented here show that several of these components are first targeted to Cajal bodies when injected into the oocyte and only subsequently appear in the chromosomes or nucleoli, where transcription itself occurs. We suggest that pol I, pol II, and pol III transcription and processing components are preassembled in Cajal bodies before transport to the chromosomes and nucleoli. Most components of the pol II transcription and processing pathway that occur in Cajal bodies are also found in the many hundreds of B-snurposomes in the germinal vesicle. Electron microscopic images show that B-snurposomes consist primarily, if not exclusively, of 20- to 30-nm particles, which closely resemble the interchromatin granules described from sections of somatic nuclei. We suggest the name pol II transcriptosome for these particles to emphasize their content of factors involved in synthesis and processing of mRNA transcripts. We present a model in which pol I, pol II, and pol III transcriptosomes are assembled in the Cajal bodies before export to the nucleolus (pol I), to the B-snurposomes and eventually to the chromosomes (pol II), and directly to the chromosomes (pol III). The key feature of this model is the preassembly of the transcription and processing machinery into unitary particles. An analogy can be made between ribosomes and transcriptosomes, ribosomes being unitary particles involved in translation and transcriptosomes being unitary particles for transcription and processing of RNA.
Resumo:
We describe the use of singular value decomposition in transforming genome-wide expression data from genes × arrays space to reduced diagonalized “eigengenes” × “eigenarrays” space, where the eigengenes (or eigenarrays) are unique orthonormal superpositions of the genes (or arrays). Normalizing the data by filtering out the eigengenes (and eigenarrays) that are inferred to represent noise or experimental artifacts enables meaningful comparison of the expression of different genes across different arrays in different experiments. Sorting the data according to the eigengenes and eigenarrays gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype, respectively. After normalization and sorting, the significant eigengenes and eigenarrays can be associated with observed genome-wide effects of regulators, or with measured samples, in which these regulators are overactive or underactive, respectively.