872 resultados para heterogeneous data sources
Resumo:
In today’s big data world, data is being produced in massive volumes, at great velocity and from a variety of different sources such as mobile devices, sensors, a plethora of small devices hooked to the internet (Internet of Things), social networks, communication networks and many others. Interactive querying and large-scale analytics are being increasingly used to derive value out of this big data. A large portion of this data is being stored and processed in the Cloud due the several advantages provided by the Cloud such as scalability, elasticity, availability, low cost of ownership and the overall economies of scale. There is thus, a growing need for large-scale cloud-based data management systems that can support real-time ingest, storage and processing of large volumes of heterogeneous data. However, in the pay-as-you-go Cloud environment, the cost of analytics can grow linearly with the time and resources required. Reducing the cost of data analytics in the Cloud thus remains a primary challenge. In my dissertation research, I have focused on building efficient and cost-effective cloud-based data management systems for different application domains that are predominant in cloud computing environments. In the first part of my dissertation, I address the problem of reducing the cost of transactional workloads on relational databases to support database-as-a-service in the Cloud. The primary challenges in supporting such workloads include choosing how to partition the data across a large number of machines, minimizing the number of distributed transactions, providing high data availability, and tolerating failures gracefully. I have designed, built and evaluated SWORD, an end-to-end scalable online transaction processing system, that utilizes workload-aware data placement and replication to minimize the number of distributed transactions that incorporates a suite of novel techniques to significantly reduce the overheads incurred both during the initial placement of data, and during query execution at runtime. In the second part of my dissertation, I focus on sampling-based progressive analytics as a means to reduce the cost of data analytics in the relational domain. Sampling has been traditionally used by data scientists to get progressive answers to complex analytical tasks over large volumes of data. Typically, this involves manually extracting samples of increasing data size (progressive samples) for exploratory querying. This provides the data scientists with user control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. I propose a new progressive data-parallel computation framework, NOW!, that provides support for progressive analytics over big data. In particular, NOW! enables progressive relational (SQL) query support in the Cloud using unique progress semantics that allow efficient and deterministic query processing over samples providing meaningful early results and provenance to data scientists. NOW! enables the provision of early results using significantly fewer resources thereby enabling a substantial reduction in the cost incurred during such analytics. Finally, I propose NSCALE, a system for efficient and cost-effective complex analytics on large-scale graph-structured data in the Cloud. The system is based on the key observation that a wide range of complex analysis tasks over graph data require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph; examples include ego network analysis, motif counting in biological networks, finding social circles in social networks, personalized recommendations, link prediction, etc. These tasks are not well served by existing vertex-centric graph processing frameworks whose computation and execution models limit the user program to directly access the state of a single vertex, resulting in high execution overheads. Further, the lack of support for extracting the relevant portions of the graph that are of interest to an analysis task and loading it onto distributed memory leads to poor scalability. NSCALE allows users to write programs at the level of neighborhoods or subgraphs rather than at the level of vertices, and to declaratively specify the subgraphs of interest. It enables the efficient distributed execution of these neighborhood-centric complex analysis tasks over largescale graphs, while minimizing resource consumption and communication cost, thereby substantially reducing the overall cost of graph data analytics in the Cloud. The results of our extensive experimental evaluation of these prototypes with several real-world data sets and applications validate the effectiveness of our techniques which provide orders-of-magnitude reductions in the overheads of distributed data querying and analysis in the Cloud.
Resumo:
Abstract and Summary of Thesis: Background: Individuals with Major Mental Illness (such as schizophrenia and bipolar disorder) experience increased rates of physical health comorbidity compared to the general population. They also experience inequalities in access to certain aspects of healthcare. This ultimately leads to premature mortality. Studies detailing patterns of physical health comorbidity are limited by their definitions of comorbidity, single disease approach to comorbidity and by the study of heterogeneous groups. To date the investigation of possible sources of healthcare inequalities experienced by individuals with Major Mental Illness (MMI) is relatively limited. Moreover studies detailing the extent of premature mortality experienced by individuals with MMI vary both in terms of the measure of premature mortality reported and age of the cohort investigated, limiting their generalisability to the wider population. Therefore local and national data can be used to describe patterns of physical health comorbidity, investigate possible reasons for health inequalities and describe mortality rates. These findings will extend existing work in this area. Aims and Objectives: To review the relevant literature regarding: patterns of physical health comorbidity, evidence for inequalities in physical healthcare and evidence for premature mortality for individuals with MMI. To examine the rates of physical health comorbidity in a large primary care database and to assess for evidence for inequalities in access to healthcare using both routine primary care prescribing data and incentivised national Quality and Outcome Framework (QOF) data. Finally to examine the rates of premature mortality in a local context with a particular focus on cause of death across the lifespan and effect of International Classification of Disease Version 10 (ICD 10) diagnosis and socioeconomic status on rates and cause of death. Methods: A narrative review of the literature surrounding patterns of physical health comorbidity, the evidence for inequalities in physical healthcare and premature mortality in MMI was undertaken. Rates of physical health comorbidity and multimorbidity in schizophrenia and bipolar disorder were examined using a large primary care dataset (Scottish Programme for Improving Clinical Effectiveness in Primary Care (SPICE)). Possible inequalities in access to healthcare were investigated by comparing patterns of prescribing in individuals with MMI and comorbid physical health conditions with prescribing rates in individuals with physical health conditions without MMI using SPICE data. Potential inequalities in access to health promotion advice (in the form of smoking cessation) and prescribing of Nicotine Replacement Therapy (NRT) were also investigated using SPICE data. Possible inequalities in access to incentivised primary healthcare were investigated using National Quality and Outcome Framework (QOF) data. Finally a pre-existing case register (Glasgow Psychosis Clinical Information System (PsyCIS)) was linked to Scottish Mortality data (available from the Scottish Government Website) to investigate rates and primary cause of death in individuals with MMI. Rate and primary cause of death were compared to the local population and impact of age, socioeconomic status and ICD 10 diagnosis (schizophrenia vs. bipolar disorder) were investigated. Results: Analysis of the SPICE data found that sixteen out of the thirty two common physical comorbidities assessed, occurred significantly more frequently in individuals with schizophrenia. In individuals with bipolar disorder fourteen occurred more frequently. The most prevalent chronic physical health conditions in individuals with schizophrenia and bipolar disorder were: viral hepatitis (Odds Ratios (OR) 3.99 95% Confidence Interval (CI) 2.82-5.64 and OR 5.90 95% CI 3.16-11.03 respectively), constipation (OR 3.24 95% CI 3.01-3.49 and OR 2.84 95% CI 2.47-3.26 respectively) and Parkinson’s disease (OR 3.07 95% CI 2.43-3.89 and OR 2.52 95% CI 1.60-3.97 respectively). Both groups had significantly increased rates of multimorbidity compared to controls: in the schizophrenia group OR for two comorbidities was 1.37 95% CI 1.29-1.45 and in the bipolar disorder group OR was 1.34 95% CI 1.20-1.49. In the studies investigating inequalities in access to healthcare there was evidence of: under-recording of cardiovascular-related conditions for example in individuals with schizophrenia: OR for Atrial Fibrillation (AF) was 0.62 95% CI 0.52 - 0.73, for hypertension 0.71 95% CI 0.67 - 0.76, for Coronary Heart Disease (CHD) 0.76 95% CI 0.69 - 0.83 and for peripheral vascular disease (PVD) 0.83 95% CI 0.72 - 0.97. Similarly in individuals with bipolar disorder OR for AF was 0.56 95% CI 0.41-0.78, for hypertension 0.69 95% CI 0.62 - 0.77 and for CHD 0.77 95% CI 0.66 - 0.91. There was also evidence of less intensive prescribing for individuals with schizophrenia and bipolar disorder who had comorbid hypertension and CHD compared to individuals with hypertension and CHD who did not have schizophrenia or bipolar disorder. Rate of prescribing of statins for individuals with schizophrenia and CHD occurred significantly less frequently than in individuals with CHD without MMI (OR 0.67 95% CI 0.56-0.80). Rates of prescribing of 2 or more anti-hypertensives were lower in individuals with CHD and schizophrenia and CHD and bipolar disorder compared to individuals with CHD without MMI (OR 0.66 95% CI 0.56-0.78 and OR 0.55 95% CI 0.46-0.67, respectively). Smoking was more common in individuals with MMI compared to individuals without MMI (OR 2.53 95% CI 2.44-2.63) and was particularly increased in men (OR 2.83 95% CI 2.68-2.98). Rates of ex-smoking and non-smoking were lower in individuals with MMI (OR 0.79 95% CI 0.75-0.83 and OR 0.50 95% CI 0.48-0.52 respectively). However recorded rates of smoking cessation advice in smokers with MMI were significantly lower than the recorded rates of smoking cessation advice in smokers with diabetes (88.7% vs. 98.0%, p<0.001), smokers with CHD (88.9% vs. 98.7%, p<0.001) and smokers with hypertension (88.3% vs. 98.5%, p<0.001) without MMI. The odds ratio of NRT prescription was also significantly lower in smokers with MMI without diabetes compared to smokers with diabetes without MMI (OR 0.75 95% CI 0.69-0.81). Similar findings were found for smokers with MMI without CHD compared to smokers with CHD without MMI (OR 0.34 95% CI 0.31-0.38) and smokers with MMI without hypertension compared to smokers with hypertension without MMI (OR 0.71 95% CI 0.66-0.76). At a national level, payment and population achievement rates for the recording of body mass index (BMI) in MMI was significantly lower than the payment and population achievement rates for BMI recording in diabetes throughout the whole of the UK combined: payment rate 92.7% (Inter Quartile Range (IQR) 89.3-95.8 vs. 95.5% IQR 93.3-97.2, p<0.001 and population achievement rate 84.0% IQR 76.3-90.0 vs. 92.5% IQR 89.7-94.9, p<0.001 and for each country individually: for example in Scotland payment rate was 94.0% IQR 91.4-97.2 vs. 96.3% IQR 94.3-97.8, p<0.001. Exception rate was significantly higher for the recording of BMI in MMI than the exception rate for BMI recording in diabetes for the UK combined: 7.4% IQR 3.3-15.9 vs. 2.3% IQR 0.9-4.7, p<0.001 and for each country individually. For example in Scotland exception rate in MMI was 11.8% IQR 5.4-19.3 compared to 3.5% IQR 1.9-6.1 in diabetes. Similar findings were found for Blood Pressure (BP) recording: across the whole of the UK payment and population achievement rates for BP recording in MMI were also significantly reduced compared to payment and population achievement rates for the recording of BP in chronic kidney disease (CKD): payment rate: 94.1% IQR 90.9-97.1 vs.97.8% IQR 96.3-98.9 and p<0.001 and population achievement rate 87.0% IQR 81.3-91.7 vs. 97.1% IQR 95.5-98.4, p<0.001. Exception rates again were significantly higher for the recording of BP in MMI compared to CKD (6.4% IQR 3.0-13.1 vs. 0.3% IQR 0.0-1.0, p<0.001). There was also evidence of differences in rates of recording of BMI and BP in MMI across the UK. BMI and BP recording in MMI were significantly lower in Scotland compared to England (BMI:-1.5% 99% CI -2.7 to -0.3%, p<0.001 and BP: -1.8% 99% CI -2.7 to -0.9%, p<0.001). While rates of BMI and BP recording in diabetes and CKD were similar in Scotland compared to England (BMI: -0.5 99% CI -1.0 to 0.05, p=0.004 and BP: 0.02 99% CI -0.2 to 0.3, p=0.797). Data from the PsyCIS cohort showed an increase in Standardised Mortality Ratios (SMR) across the lifespan for individuals with MMI compared to the local Glasgow and wider Scottish populations (Glasgow SMR 1.8 95% CI 1.6-2.0 and Scotland SMR 2.7 95% CI 2.4-3.1). Increasing socioeconomic deprivation was associated with an increased overall rate of death in MMI (350.3 deaths/10,000 population/5 years in the least deprived quintile compared to 794.6 deaths/10,000 population/5 years in the most deprived quintile). No significant difference in rate of death for individuals with schizophrenia compared with bipolar disorder was reported (6.3% vs. 4.9%, p=0.086), but primary cause of death varied: with higher rates of suicide in individuals with bipolar disorder (22.4% vs. 11.7%, p=0.04). Discussion: Local and national datasets can be used for epidemiological study to inform local practice and complement existing national and international studies. While the strengths of this thesis include the large data sets used and therefore their likely representativeness to the wider population, some limitations largely associated with using secondary data sources are acknowledged. While this thesis has confirmed evidence of increased physical health comorbidity and multimorbidity in individuals with MMI, it is likely that these findings represent a significant under reporting and likely under recognition of physical health comorbidity in this population. This is likely due to a combination of patient, health professional and healthcare system factors and requires further investigation. Moreover, evidence of inequality in access to healthcare in terms of: physical health promotion (namely smoking cessation advice), recording of physical health indices (BMI and BP), prescribing of medications for the treatment of physical illness and prescribing of NRT has been found at a national level. While significant premature mortality in individuals with MMI within a Scottish setting has been confirmed, more work is required to further detail and investigate the impact of socioeconomic deprivation on cause and rate of death in this population. It is clear that further education and training is required for all healthcare staff to improve the recognition, diagnosis and treatment of physical health problems in this population with the aim of addressing the significant premature mortality that is seen. Conclusions: Future work lies in the challenge of designing strategies to reduce health inequalities and narrow the gap in premature mortality reported in individuals with MMI. Models of care that allow a much more integrated approach to diagnosing, monitoring and treating both the physical and mental health of individuals with MMI, particularly in areas of social and economic deprivation may be helpful. Strategies to engage this “hard to reach” population also need to be developed. While greater integration of psychiatric services with primary care and with specialist medical services is clearly vital the evidence on how best to achieve this is limited. While the National Health Service (NHS) is currently undergoing major reform, attention needs to be paid to designing better ways to improve the current disconnect between primary and secondary care. This should then help to improve physical, psychological and social outcomes for individuals with MMI.
Resumo:
El proceso de toma de decisiones en las bibliotecas universitarias es de suma importancia, sin embargo, se encuentra complicaciones como la gran cantidad de fuentes de datos y los grandes volúmenes de datos a analizar. Las bibliotecas universitarias están acostumbradas a producir y recopilar una gran cantidad de información sobre sus datos y servicios. Las fuentes de datos comunes son el resultado de sistemas internos, portales y catálogos en línea, evaluaciones de calidad y encuestas. Desafortunadamente estas fuentes de datos sólo se utilizan parcialmente para la toma de decisiones debido a la amplia variedad de formatos y estándares, así como la falta de métodos eficientes y herramientas de integración. Este proyecto de tesis presenta el análisis, diseño e implementación del Data Warehouse, que es un sistema integrado de toma de decisiones para el Centro de Documentación Juan Bautista Vázquez. En primer lugar se presenta los requerimientos y el análisis de los datos en base a una metodología, esta metodología incorpora elementos claves incluyendo el análisis de procesos, la calidad estimada, la información relevante y la interacción con el usuario que influyen en una decisión bibliotecaria. A continuación, se propone la arquitectura y el diseño del Data Warehouse y su respectiva implementación la misma que soporta la integración, procesamiento y el almacenamiento de datos. Finalmente los datos almacenados se analizan a través de herramientas de procesamiento analítico y la aplicación de técnicas de Bibliomining ayudando a los administradores del centro de documentación a tomar decisiones óptimas sobre sus recursos y servicios.
Resumo:
Background: High-density tiling arrays and new sequencing technologies are generating rapidly increasing volumes of transcriptome and protein-DNA interaction data. Visualization and exploration of this data is critical to understanding the regulatory logic encoded in the genome by which the cell dynamically affects its physiology and interacts with its environment. Results: The Gaggle Genome Browser is a cross-platform desktop program for interactively visualizing high-throughput data in the context of the genome. Important features include dynamic panning and zooming, keyword search and open interoperability through the Gaggle framework. Users may bookmark locations on the genome with descriptive annotations and share these bookmarks with other users. The program handles large sets of user-generated data using an in-process database and leverages the facilities of SQL and the R environment for importing and manipulating data. A key aspect of the Gaggle Genome Browser is interoperability. By connecting to the Gaggle framework, the genome browser joins a suite of interconnected bioinformatics tools for analysis and visualization with connectivity to major public repositories of sequences, interactions and pathways. To this flexible environment for exploring and combining data, the Gaggle Genome Browser adds the ability to visualize diverse types of data in relation to its coordinates on the genome. Conclusions: Genomic coordinates function as a common key by which disparate biological data types can be related to one another. In the Gaggle Genome Browser, heterogeneous data are joined by their location on the genome to create information-rich visualizations yielding insight into genome organization, transcription and its regulation and, ultimately, a better understanding of the mechanisms that enable the cell to dynamically respond to its environment.
Resumo:
Objective: To illustrate methodological issues involved in estimating dietary trends in populations using data obtained from various sources in Australia in the 1980s and 1990s. Methods: Estimates of absolute and relative change in consumption of selected food items were calculated using national data published annually on the national food supply for 1982-83 to 1992-93 and responses to food frequency questions in two population based risk factor surveys in 1983 and 1994 in the Hunter Region of New South Wales, Australia. The validity of estimated food quantities obtained from these inexpensive sources at the beginning of the period was assessed by comparison with data from a national dietary survey conducted in 1983 using 24 h recall. Results: Trend estimates from the food supply data and risk factor survey data were in good agreement for increases in consumption of fresh fruit, vegetables and breakfast food and decreases in butter, margarine, sugar and alcohol. Estimates for trends in milk, eggs and bread consumption, however, were inconsistent. Conclusions: Both data sources can be used for monitoring progress towards national nutrition goals based on selected food items provided that some limitations are recognized. While data collection methods should be consistent over time they also need to allow for changes in the food supply (for example the introduction of new varieties such as low-fat dairy products). From time to time the trends derived from these inexpensive data sources should be compared with data derived from more detailed and quantitative estimates of dietary intake.
Resumo:
There are two main types of data sources of income distributions in China: household survey data and grouped data. Household survey data are typically available for isolated years and individual provinces. In comparison, aggregate or grouped data are typically available more frequently and usually have national coverage. In principle, grouped data allow investigation of the change of inequality over longer, continuous periods of time, and the identification of patterns of inequality across broader regions. Nevertheless, a major limitation of grouped data is that only mean (average) income and income shares of quintile or decile groups of the population are reported. Directly using grouped data reported in this format is equivalent to assuming that all individuals in a quintile or decile group have the same income. This potentially distorts the estimate of inequality within each region. The aim of this paper is to apply an improved econometric method designed to use grouped data to study income inequality in China. A generalized beta distribution is employed to model income inequality in China at various levels and periods of time. The generalized beta distribution is more general and flexible than the lognormal distribution that has been used in past research, and also relaxes the assumption of a uniform distribution of income within quintile and decile groups of populations. The paper studies the nature and extent of inequality in rural and urban China over the period 1978 to 2002. Income inequality in the whole of China is then modeled using a mixture of province-specific distributions. The estimated results are used to study the trends in national inequality, and to discuss the empirical findings in the light of economic reforms, regional policies, and globalization of the Chinese economy.
Resumo:
Background: With the decrease of DNA sequencing costs, sequence-based typing methods are rapidly becoming the gold standard for epidemiological surveillance. These methods provide reproducible and comparable results needed for a global scale bacterial population analysis, while retaining their usefulness for local epidemiological surveys. Online databases that collect the generated allelic profiles and associated epidemiological data are available but this wealth of data remains underused and are frequently poorly annotated since no user-friendly tool exists to analyze and explore it. Results: PHYLOViZ is platform independent Java software that allows the integrated analysis of sequence-based typing methods, including SNP data generated from whole genome sequence approaches, and associated epidemiological data. goeBURST and its Minimum Spanning Tree expansion are used for visualizing the possible evolutionary relationships between isolates. The results can be displayed as an annotated graph overlaying the query results of any other epidemiological data available. Conclusions: PHYLOViZ is a user-friendly software that allows the combined analysis of multiple data sources for microbial epidemiological and population studies. It is freely available at http://www.phyloviz.net.
Resumo:
Esta dissertação incide sobre a problemática da construção de um data warehouse para a empresa AdClick que opera na área de marketing digital. O marketing digital é um tipo de marketing que utiliza os meios de comunicação digital, com a mesma finalidade do método tradicional que se traduz na divulgação de bens, negócios e serviços e a angariação de novos clientes. Existem diversas estratégias de marketing digital tendo em vista atingir tais objetivos, destacando-se o tráfego orgânico e tráfego pago. Onde o tráfego orgânico é caracterizado pelo desenvolvimento de ações de marketing que não envolvem quaisquer custos inerentes à divulgação e/ou angariação de potenciais clientes. Por sua vez o tráfego pago manifesta-se pela necessidade de investimento em campanhas capazes de impulsionar e atrair novos clientes. Inicialmente é feita uma abordagem do estado da arte sobre business intelligence e data warehousing, e apresentadas as suas principais vantagens as empresas. Os sistemas business intelligence são necessários, porque atualmente as empresas detêm elevados volumes de dados ricos em informação, que só serão devidamente explorados fazendo uso das potencialidades destes sistemas. Nesse sentido, o primeiro passo no desenvolvimento de um sistema business intelligence é concentrar todos os dados num sistema único integrado e capaz de dar apoio na tomada de decisões. É então aqui que encontramos a construção do data warehouse como o sistema único e ideal para este tipo de requisitos. Nesta dissertação foi elaborado o levantamento das fontes de dados que irão abastecer o data warehouse e iniciada a contextualização dos processos de negócio existentes na empresa. Após este momento deu-se início à construção do data warehouse, criação das dimensões e tabelas de factos e definição dos processos de extração e carregamento dos dados para o data warehouse. Assim como a criação das diversas views. Relativamente ao impacto que esta dissertação atingiu destacam-se as diversas vantagem a nível empresarial que a empresa parceira neste trabalho retira com a implementação do data warehouse e os processos de ETL para carregamento de todas as fontes de informação. Sendo que algumas vantagens são a centralização da informação, mais flexibilidade para os gestores na forma como acedem à informação. O tratamento dos dados de forma a ser possível a extração de informação a partir dos mesmos.
Resumo:
Dissertation presented to obtain the Ph.D degree in Bioinformatics
Resumo:
During the last few years many research efforts have been done to improve the design of ETL (Extract-Transform-Load) systems. ETL systems are considered very time-consuming, error-prone and complex involving several participants from different knowledge domains. ETL processes are one of the most important components of a data warehousing system that are strongly influenced by the complexity of business requirements, their changing and evolution. These aspects influence not only the structure of a data warehouse but also the structures of the data sources involved with. To minimize the negative impact of such variables, we propose the use of ETL patterns to build specific ETL packages. In this paper, we formalize this approach using BPMN (Business Process Modelling Language) for modelling more conceptual ETL workflows, mapping them to real execution primitives through the use of a domain-specific language that allows for the generation of specific instances that can be executed in an ETL commercial tool.
Resumo:
OpenAIRE supports the European Commission Open Access policy by providing an infrastructure for researchers to comply with the European Union Open Access mandate. The current OpenAIRE infrastructure and services, resulting from OpenAIRE and OpenAIREplus FP7 projects, builds on Open Access research results from a wide range of repositories and other data sources: institutional or thematic publication repositories, Open Access journals, data repositories, Current Research Information Systems and aggregators. (...)
Resumo:
We explore the determinants of usage of six different types of health care services, using the Medical Expenditure Panel Survey data, years 1996-2000. We apply a number of models for univariate count data, including semiparametric, semi-nonparametric and finite mixture models. We find that the complexity of the model that is required to fit the data well depends upon the way in which the data is pooled across sexes and over time, and upon the characteristics of the usage measure. Pooling across time and sexes is almost always favored, but when more heterogeneous data is pooled it is often the case that a more complex statistical model is required.
Resumo:
As life expectancy continues to rise, the prevalence of chronic conditions is increasing in our society. However, we do not know if the extra years of life gained are being spent with disability and illness, or in good health. Furthermore, it is unclear if all groups in society experience their extra years of life in the same way. This report examines patterns of health expectancies across the island of Ireland, examining any North-South and socio-economic differences as well looking at differences in data sources. The older population (aged 65 or over) on the island of Ireland is growing and becoming a larger percentage of the total population. Republic of Ireland Census 2011 revealed that 12% of the RoI population was aged 65 or over (CSO, 2012), and Northern Ireland Census 2011 revealed that 13% of the NI population was aged 65 or over (NISRA, 2012). By 2041 the population aged 65 or over is projected to reach 22% in RoI and 24% in NI (McGill, 2010). It is unclear, however, if this increasing longevity will be enjoyed equally by all strata of society.
Resumo:
Background: The variety of DNA microarray formats and datasets presently available offers an unprecedented opportunity to perform insightful comparisons of heterogeneous data. Cross-species studies, in particular, have the power of identifying conserved, functionally important molecular processes. Validation of discoveries can now often be performed in readily available public data which frequently requires cross-platform studies.Cross-platform and cross-species analyses require matching probes on different microarray formats. This can be achieved using the information in microarray annotations and additional molecular biology databases, such as orthology databases. Although annotations and other biological information are stored using modern database models ( e. g. relational), they are very often distributed and shared as tables in text files, i.e. flat file databases. This common flat database format thus provides a simple and robust solution to flexibly integrate various sources of information and a basis for the combined analysis of heterogeneous gene expression profiles.Results: We provide annotationTools, a Bioconductor-compliant R package to annotate microarray experiments and integrate heterogeneous gene expression profiles using annotation and other molecular biology information available as flat file databases. First, annotationTools contains a specialized set of functions for mining this widely used database format in a systematic manner. It thus offers a straightforward solution for annotating microarray experiments. Second, building on these basic functions and relying on the combination of information from several databases, it provides tools to easily perform cross-species analyses of gene expression data.Here, we present two example applications of annotationTools that are of direct relevance for the analysis of heterogeneous gene expression profiles, namely a cross-platform mapping of probes and a cross-species mapping of orthologous probes using different orthology databases. We also show how to perform an explorative comparison of disease-related transcriptional changes in human patients and in a genetic mouse model.Conclusion: The R package annotationTools provides a simple solution to handle microarray annotation and orthology tables, as well as other flat molecular biology databases. Thereby, it allows easy integration and analysis of heterogeneous microarray experiments across different technological platforms or species.
Resumo:
The analysis of multi-modal and multi-sensor images is nowadays of paramount importance for Earth Observation (EO) applications. There exist a variety of methods that aim at fusing the different sources of information to obtain a compact representation of such datasets. However, for change detection existing methods are often unable to deal with heterogeneous image sources and very few consider possible nonlinearities in the data. Additionally, the availability of labeled information is very limited in change detection applications. For these reasons, we present the use of a semi-supervised kernel-based feature extraction technique. It incorporates a manifold regularization accounting for the geometric distribution and jointly addressing the small sample problem. An exhaustive example using Landsat 5 data illustrates the potential of the method for multi-sensor change detection.