863 resultados para Data sources detection
Resumo:
Background: With the decrease of DNA sequencing costs, sequence-based typing methods are rapidly becoming the gold standard for epidemiological surveillance. These methods provide reproducible and comparable results needed for a global scale bacterial population analysis, while retaining their usefulness for local epidemiological surveys. Online databases that collect the generated allelic profiles and associated epidemiological data are available but this wealth of data remains underused and are frequently poorly annotated since no user-friendly tool exists to analyze and explore it. Results: PHYLOViZ is platform independent Java software that allows the integrated analysis of sequence-based typing methods, including SNP data generated from whole genome sequence approaches, and associated epidemiological data. goeBURST and its Minimum Spanning Tree expansion are used for visualizing the possible evolutionary relationships between isolates. The results can be displayed as an annotated graph overlaying the query results of any other epidemiological data available. Conclusions: PHYLOViZ is a user-friendly software that allows the combined analysis of multiple data sources for microbial epidemiological and population studies. It is freely available at http://www.phyloviz.net.
Resumo:
This paper presents the SmartClean tool. The purpose of this tool is to detect and correct the data quality problems (DQPs). Compared with existing tools, SmartClean has the following main advantage: the user does not need to specify the execution sequence of the data cleaning operations. For that, an execution sequence was developed. The problems are manipulated (i.e., detected and corrected) following that sequence. The sequence also supports the incremental execution of the operations. In this paper, the underlying architecture of the tool is presented and its components are described in detail. The tool's validity and, consequently, of the architecture is demonstrated through the presentation of a case study. Although SmartClean has cleaning capabilities in all other levels, in this paper are only described those related with the attribute value level.
Resumo:
Esta dissertação incide sobre a problemática da construção de um data warehouse para a empresa AdClick que opera na área de marketing digital. O marketing digital é um tipo de marketing que utiliza os meios de comunicação digital, com a mesma finalidade do método tradicional que se traduz na divulgação de bens, negócios e serviços e a angariação de novos clientes. Existem diversas estratégias de marketing digital tendo em vista atingir tais objetivos, destacando-se o tráfego orgânico e tráfego pago. Onde o tráfego orgânico é caracterizado pelo desenvolvimento de ações de marketing que não envolvem quaisquer custos inerentes à divulgação e/ou angariação de potenciais clientes. Por sua vez o tráfego pago manifesta-se pela necessidade de investimento em campanhas capazes de impulsionar e atrair novos clientes. Inicialmente é feita uma abordagem do estado da arte sobre business intelligence e data warehousing, e apresentadas as suas principais vantagens as empresas. Os sistemas business intelligence são necessários, porque atualmente as empresas detêm elevados volumes de dados ricos em informação, que só serão devidamente explorados fazendo uso das potencialidades destes sistemas. Nesse sentido, o primeiro passo no desenvolvimento de um sistema business intelligence é concentrar todos os dados num sistema único integrado e capaz de dar apoio na tomada de decisões. É então aqui que encontramos a construção do data warehouse como o sistema único e ideal para este tipo de requisitos. Nesta dissertação foi elaborado o levantamento das fontes de dados que irão abastecer o data warehouse e iniciada a contextualização dos processos de negócio existentes na empresa. Após este momento deu-se início à construção do data warehouse, criação das dimensões e tabelas de factos e definição dos processos de extração e carregamento dos dados para o data warehouse. Assim como a criação das diversas views. Relativamente ao impacto que esta dissertação atingiu destacam-se as diversas vantagem a nível empresarial que a empresa parceira neste trabalho retira com a implementação do data warehouse e os processos de ETL para carregamento de todas as fontes de informação. Sendo que algumas vantagens são a centralização da informação, mais flexibilidade para os gestores na forma como acedem à informação. O tratamento dos dados de forma a ser possível a extração de informação a partir dos mesmos.
Resumo:
Dissertation presented to obtain the Ph.D degree in Bioinformatics
Resumo:
Introduction Leprosy remains a relevant public health issue in Brazil. The objective of this study was to analyze the spatial distribution of new cases of leprosy and to detect areas with higher risks of disease in the City of Vitória. Methods The study was ecologically based on the spatial distribution of leprosy in the City of Vitória, State of Espírito Santo between 2005 and 2009. The data sources used came from the available records of the State Health Secretary of Espírito Santo. A global and local empirical Bayesian method was used in the spatial analysis to produce a leprosy risk estimation, and the fluctuation effect was smoothed from the detection coefficients. Results The study used thematic maps to illustrate that leprosy is distributed heterogeneously between the neighborhoods and that it is possible to identify areas with high risk of disease. The Pearson correlation coefficient of 0.926 (p = 0.001) for the Local Method indicated highly correlated coefficients. The Moran index was calculated to evaluate correlations between the incidences of adjoining districts. Conclusions We identified the spatial contexts in which there were the highest incidence rates of leprosy in Vitória during the studied period. The results contribute to the knowledge of the spatial distribution of leprosy in the City of Vitória, which can help establish more cost-effective control strategies because they indicate specific regions and priority planning activities that can interfere with the transmission chain.
Resumo:
During the last few years many research efforts have been done to improve the design of ETL (Extract-Transform-Load) systems. ETL systems are considered very time-consuming, error-prone and complex involving several participants from different knowledge domains. ETL processes are one of the most important components of a data warehousing system that are strongly influenced by the complexity of business requirements, their changing and evolution. These aspects influence not only the structure of a data warehouse but also the structures of the data sources involved with. To minimize the negative impact of such variables, we propose the use of ETL patterns to build specific ETL packages. In this paper, we formalize this approach using BPMN (Business Process Modelling Language) for modelling more conceptual ETL workflows, mapping them to real execution primitives through the use of a domain-specific language that allows for the generation of specific instances that can be executed in an ETL commercial tool.
Resumo:
OpenAIRE supports the European Commission Open Access policy by providing an infrastructure for researchers to comply with the European Union Open Access mandate. The current OpenAIRE infrastructure and services, resulting from OpenAIRE and OpenAIREplus FP7 projects, builds on Open Access research results from a wide range of repositories and other data sources: institutional or thematic publication repositories, Open Access journals, data repositories, Current Research Information Systems and aggregators. (...)
Resumo:
The MAP-i Doctoral Programme in Informatics, of the Universities of Minho, Aveiro and Porto
Resumo:
As life expectancy continues to rise, the prevalence of chronic conditions is increasing in our society. However, we do not know if the extra years of life gained are being spent with disability and illness, or in good health. Furthermore, it is unclear if all groups in society experience their extra years of life in the same way. This report examines patterns of health expectancies across the island of Ireland, examining any North-South and socio-economic differences as well looking at differences in data sources. The older population (aged 65 or over) on the island of Ireland is growing and becoming a larger percentage of the total population. Republic of Ireland Census 2011 revealed that 12% of the RoI population was aged 65 or over (CSO, 2012), and Northern Ireland Census 2011 revealed that 13% of the NI population was aged 65 or over (NISRA, 2012). By 2041 the population aged 65 or over is projected to reach 22% in RoI and 24% in NI (McGill, 2010). It is unclear, however, if this increasing longevity will be enjoyed equally by all strata of society.
Resumo:
Background: Systematic approaches for identifying proteins involved in different types of cancer are needed. Experimental techniques such as microarrays are being used to characterize cancer, but validating their results can be a laborious task. Computational approaches are used to prioritize between genes putatively involved in cancer, usually based on further analyzing experimental data. Results: We implemented a systematic method using the PIANA software that predicts cancer involvement of genes by integrating heterogeneous datasets. Specifically, we produced lists of genes likely to be involved in cancer by relying on: (i) protein-protein interactions; (ii) differential expression data; and (iii) structural and functional properties of cancer genes. The integrative approach that combines multiple sources of data obtained positive predictive values ranging from 23% (on a list of 811 genes) to 73% (on a list of 22 genes), outperforming the use of any of the data sources alone. We analyze a list of 20 cancer gene predictions, finding that most of them have been recently linked to cancer in literature. Conclusion: Our approach to identifying and prioritizing candidate cancer genes can be used to produce lists of genes likely to be involved in cancer. Our results suggest that differential expression studies yielding high numbers of candidate cancer genes can be filtered using protein interaction networks.
Resumo:
Context There are no evidence syntheses available to guide clinicians on when to titrate antihypertensive medication after initiation. Objective To model the blood pressure (BP) response after initiating antihypertensive medication. Data sources electronic databases including Medline, Embase, Cochrane Register and reference lists up to December 2009. Study selection Trials that initiated antihypertensive medication as single therapy in hypertensive patients who were either drug naive or had a placebo washout from previous drugs. Data extraction Office BP measurements at a minimum of two weekly intervals for a minimum of 4 weeks. An asymptotic approach model of BP response was assumed and non-linear mixed effects modelling used to calculate model parameters. Results and conclusions Eighteen trials that recruited 4168 patients met inclusion criteria. The time to reach 50% of the maximum estimated BP lowering effect was 1 week (systolic 0.91 weeks, 95% CI 0.74 to 1.10; diastolic 0.95, 0.75 to 1.15). Models incorporating drug class as a source of variability did not improve fit of the data. Incorporating the presence of a titration schedule improved model fit for both systolic and diastolic pressure. Titration increased both the predicted maximum effect and the time taken to reach 50% of the maximum (systolic 1.2 vs 0.7 weeks; diastolic 1.4 vs 0.7 weeks). Conclusions Estimates of the maximum efficacy of antihypertensive agents can be made early after starting therapy. This knowledge will guide clinicians in deciding when a newly started antihypertensive agent is likely to be effective or not at controlling BP.
Resumo:
According to the Centers for Disease Control and Prevention, unintentional injury is the fifth leading cause of death for all age groups and the first leading cause of death for people from 1 to 44 years of age in the United States, while homicide remains the 2nd leading cause of death for 15 to 24 years old (CDC, 2006). In 2004, there were approximately 144,000 deaths due to unintentional injuries in the US; 53% of which represent people over 45 years of age (CDC, 2004). With 20,322 suicidal deaths and 13,170 homicidal deaths, intentional injury deaths affect mostly people under 45 years old. On average, there are 1,150 unintentional deaths per year in Iowa. In 2004, 37% of unintentional deaths were due to motor vehicle accidents (MTVCC) occurring across all age ranges and 30% were due to falls involving persons over 65 years of age 82% of the time (IDPH Health Stat Div., 2004). The most debilitating outcome of injury is traumatic brain injury, which is characterized by the irreversibility of its damages, long-term effects on quality of life, and healthcare costs. The latest data available from the CDC estimated that, nationally, 50,000 traumatic brain injured (TBI) people die each year; three times as many are hospitalized and more than twenty times as many are released from emergency room (ER) departments (CDC, 2006). Besides the TBI registry, brain injury data is also captured through three other data sources: 1) death certificates; 2) hospital inpatient data; and, 3) hospital outpatient data. The inpatient and outpatient hospital data are managed by the Iowa Hospital Association, which provides to Iowa Department of Public Health the hospital data without personal identifiers. (The hospitals send reports to the Agency of Health Care Research and Quality, which developed the Health Care Utilization Project and its product, the National Inpatient Sample).
Resumo:
Transportation planners typically use census data or small sample surveys to help estimate work trips in metropolitan areas. Census data are cheap to use but are only collected every 10 years and may not provide the answers that a planner is seeking. On the other hand, small sample survey data are fresh but can be very expensive to collect. This project involved using database and geographic information systems (GIS) technology to relate several administrative data sources that are not usually employed by transportation planners. These data sources included data collected by state agencies for unemployment insurance purposes and for drivers licensing. Together, these data sources could allow better estimates of the following information for a metropolitan area or planning region: · Locations of employers (work sites); · Locations of employees; · Travel flows between employees’ homes and their work locations. The required new employment database was created for a large, multi-county region in central Iowa. When evaluated against the estimates of a metropolitan planning organization, the new database did allow for a one to four percent improvement in estimates over the traditional approach. While this does not sound highly significant, the approach using improved employment data to synthesize home-based work (HBW) trip tables was particularly beneficial in improving estimated traffic on high-capacity routes. These are precisely the routes that transportation planners are most interested in modeling accurately. Therefore, the concept of using improved employment data for transportation planning was considered valuable and worthy of follow-up research.
Resumo:
Background Nowadays, combining the different sources of information to improve the biological knowledge available is a challenge in bioinformatics. One of the most powerful methods for integrating heterogeneous data types are kernel-based methods. Kernel-based data integration approaches consist of two basic steps: firstly the right kernel is chosen for each data set; secondly the kernels from the different data sources are combined to give a complete representation of the available data for a given statistical task. Results We analyze the integration of data from several sources of information using kernel PCA, from the point of view of reducing dimensionality. Moreover, we improve the interpretability of kernel PCA by adding to the plot the representation of the input variables that belong to any dataset. In particular, for each input variable or linear combination of input variables, we can represent the direction of maximum growth locally, which allows us to identify those samples with higher/lower values of the variables analyzed. Conclusions The integration of different datasets and the simultaneous representation of samples and variables together give us a better understanding of biological knowledge.
Resumo:
CONTEXT: Subclinical hypothyroidism has been associated with increased risk of coronary heart disease (CHD), particularly with thyrotropin levels of 10.0 mIU/L or greater. The measurement of thyroid antibodies helps predict the progression to overt hypothyroidism, but it is unclear whether thyroid autoimmunity independently affects CHD risk. OBJECTIVE: The objective of the study was to compare the CHD risk of subclinical hypothyroidism with and without thyroid peroxidase antibodies (TPOAbs). DATA SOURCES AND STUDY SELECTION: A MEDLINE and EMBASE search from 1950 to 2011 was conducted for prospective cohorts, reporting baseline thyroid function, antibodies, and CHD outcomes. DATA EXTRACTION: Individual data of 38 274 participants from six cohorts for CHD mortality followed up for 460 333 person-years and 33 394 participants from four cohorts for CHD events. DATA SYNTHESIS: Among 38 274 adults (median age 55 y, 63% women), 1691 (4.4%) had subclinical hypothyroidism, of whom 775 (45.8%) had positive TPOAbs. During follow-up, 1436 participants died of CHD and 3285 had CHD events. Compared with euthyroid individuals, age- and gender-adjusted risks of CHD mortality in subclinical hypothyroidism were similar among individuals with and without TPOAbs [hazard ratio (HR) 1.15, 95% confidence interval (CI) 0.87-1.53 vs HR 1.26, CI 1.01-1.58, P for interaction = .62], as were risks of CHD events (HR 1.16, CI 0.87-1.56 vs HR 1.26, CI 1.02-1.56, P for interaction = .65). Risks of CHD mortality and events increased with higher thyrotropin, but within each stratum, risks did not differ by TPOAb status. CONCLUSIONS: CHD risk associated with subclinical hypothyroidism did not differ by TPOAb status, suggesting that biomarkers of thyroid autoimmunity do not add independent prognostic information for CHD outcomes.