978 resultados para Prediction algorithms


Relevância:

60.00% 60.00%

Publicador:

Resumo:

One of the fundamental machine learning tasks is that of predictive classification. Given that organisations collect an ever increasing amount of data, predictive classification methods must be able to effectively and efficiently handle large amounts of data. However, it is understood that present requirements push existing algorithms to, and sometimes beyond, their limits since many classification prediction algorithms were designed when currently common data set sizes were beyond imagination. This has led to a significant amount of research into ways of making classification learning algorithms more effective and efficient. Although substantial progress has been made, a number of key questions have not been answered. This dissertation investigates two of these key questions. The first is whether different types of algorithms to those currently employed are required when using large data sets. This is answered by analysis of the way in which the bias plus variance decomposition of predictive classification error changes as training set size is increased. Experiments find that larger training sets require different types of algorithms to those currently used. Some insight into the characteristics of suitable algorithms is provided, and this may provide some direction for the development of future classification prediction algorithms which are specifically designed for use with large data sets. The second question investigated is that of the role of sampling in machine learning with large data sets. Sampling has long been used as a means of avoiding the need to scale up algorithms to suit the size of the data set by scaling down the size of the data sets to suit the algorithm. However, the costs of performing sampling have not been widely explored. Two popular sampling methods are compared with learning from all available data in terms of predictive accuracy, model complexity, and execution time. The comparison shows that sub-sampling generally products models with accuracy close to, and sometimes greater than, that obtainable from learning with all available data. This result suggests that it may be possible to develop algorithms that take advantage of the sub-sampling methodology to reduce the time required to infer a model while sacrificing little if any accuracy. Methods of improving effective and efficient learning via sampling are also investigated, and now sampling methodologies proposed. These methodologies include using a varying-proportion of instances to determine the next inference step and using a statistical calculation at each inference step to determine sufficient sample size. Experiments show that using a statistical calculation of sample size can not only substantially reduce execution time but can do so with only a small loss, and occasional gain, in accuracy. One of the common uses of sampling is in the construction of learning curves. Learning curves are often used to attempt to determine the optimal training size which will maximally reduce execution time while nut being detrimental to accuracy. An analysis of the performance of methods for detection of convergence of learning curves is performed, with the focus of the analysis on methods that calculate the gradient, of the tangent to the curve. Given that such methods can be susceptible to local accuracy plateaus, an investigation into the frequency of local plateaus is also performed. It is shown that local accuracy plateaus are a common occurrence, and that ensuring a small loss of accuracy often results in greater computational cost than learning from all available data. These results cast doubt over the applicability of gradient of tangent methods for detecting convergence, and of the viability of learning curves for reducing execution time in general.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In der vorliegenden Dissertation wurden verschiedene Kandidatengene für den Wilmstumor (WT), eine Tumorerkrankung der Niere, identifiziert und charakterisiert. Da dieses frühkindliche Malignom aus einer inkorrekt ablaufenden Metanephrogenese resultiert, wurden die Genexpressionsmuster verschiedener humaner Wilmstumor- und Normalnierengewebe (adulte sowie fetale Niere) mit Hilfe der Technik des differential display verglichen und die als differenziell exprimiert identifizierten Gene kloniert und charakterisiert. Bei TM7SF1 handelt es sich um ein neues Gen, dessen Transkription im Zuge der Metanephrogenese angeschaltet wird. Das von ihm codierte putative Protein kann aufgrund von Strukturvorhersagen vermutlich zur Familie G Protein-gekoppelter Rezeptoren gezählt werden. Die ableitbare Funktion als Signalmolekül der Nierenentwicklung, sowie seine Lokalisation in einem WT-Lokus (1q42-q43) machen TM7SF1 zu einem aussichtsreichen Kandidatengen für den WT. Darüber hinaus konnten die Voraussetzungen für funktionelle Tests, die eine weitere Charakterisierung von TM7SF1 erlauben, geschaffen werden (Identifikation und Klonierung des murinen Homologen, stabil überexprimierende WT-Zelllinien, Antikörper gegen den Aminoterminus des putativen Proteins). Mit TCF2 wurde ein weiteres Gen identifiziert, dessen Produkt in Prozessen der Metanephrogenese eine Rolle spielt. Die signifikante Herunterregulation der TCF2-Expression in der großen Mehrzahl der untersuchten WTs, die innerhalb der vorliegenden Arbeit gezeigte Regulation durch das WT1-Genprodukt, sowie seine genomische Lokalisation in einem Intervall für die familiäre Form des WT (FWT1 in 17q12-q21) zeigen das Potenzial von TCF2, als Kandidatengen für den FWT zu gelten. Darüber hinaus wurde mit GLI3 ein in verschiedenen WTs stark exprimiertes Gen identifiziert. Sein Produkt ist eine Komponente des entwicklungsbiologisch relevanten und in verschiedene Tumorerkrankungen involvierten sonic hedgehog-Signaltransduktionsweges. Mit FE7A3 und CDT151 konnten zwei differenziell exprimierte cDNAs identifiziert werden, die Teile neuer Gene darstellen und die in WT-Loci kartiert werden konnten. Aufgrund von Homologievergleichen im Bereich der identifizierten offenen Leserahmen konnte eine mögliche Bedeutung der putativen Genprodukte für die WT-Pathogenese als Zelladhäsionsmolekül (FE7A3) bzw. als mit der Proliferation assoziiertem Transkriptionsfaktor (CDT151) herausgearbeitet werden. Neben den komparativen Genexpressionsuntersuchungen wurde in einem zweiten Ansatz die transkriptionelle Regulation des einzigen bisher klonierten Wilmstumorgens (WT1) analysiert. Mit Hilfe vergleichender Reportergenanalysen in WT1-exprimierenden und nicht-exprimierenden Zelllinien konnten neue für die transkriptionelle Regulation von WT1 relevante Bereiche identifiziert werden. Darüber hinaus wurde der für die Transkriptionsfaktoren SP1 und SP3 an anderen Promotoren beschriebene funktionelle Antagonismus für die WT1-Expression untersucht und in Gelretardationsanalysen mit dem WT1-Expressionsstatus oben genannter Zelllinien korreliert.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Treatment with growth hormone (GH) has become standard practice for replacement in GH-deficient children or pharmacotherapy in a variety of disorders with short stature. However, even today, the reported adult heights achieved often remain below the normal range. In addition, the treatment is expensive and may be associated with long-term risks. Thus, a discussion of the factors relevant for achieving an optimal individual outcome in terms of growth, costs, and risks is required. In the present review, the heterogenous approaches of treatment with GH are discussed, considering the parameters available for an evaluation of the short- and long-term outcomes at different stages of treatment. This discourse introduces the potential of the newly emerging prediction algorithms in comparison to other more conventional approaches for the planning and evaluation of the response to GH. In rare disorders such as those with short stature, treatment decisions cannot easily be deduced from personal experience. An interactive approach utilizing the derived experience from large cohorts for the evaluation of the individual patient and the required decision-making may facilitate the use of GH. Such an approach should also lead to avoiding unnecessary long-term treatment in unresponsive individuals.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Electroencephalograms (EEG) are often contaminated with high amplitude artifacts limiting the usability of data. Methods that reduce these artifacts are often restricted to certain types of artifacts, require manual interaction or large training data sets. Within this paper we introduce a novel method, which is able to eliminate many different types of artifacts without manual intervention. The algorithm first decomposes the signal into different sub-band signals in order to isolate different types of artifacts into specific frequency bands. After signal decomposition with principal component analysis (PCA) an adaptive threshold is applied to eliminate components with high variance corresponding to the dominant artifact activity. Our results show that the algorithm is able to significantly reduce artifacts while preserving the EEG activity. Parameters for the algorithm do not have to be identified for every patient individually making the method a good candidate for preprocessing in automatic seizure detection and prediction algorithms.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Maximizing data quality may be especially difficult in trauma-related clinical research. Strategies are needed to improve data quality and assess the impact of data quality on clinical predictive models. This study had two objectives. The first was to compare missing data between two multi-center trauma transfusion studies: a retrospective study (RS) using medical chart data with minimal data quality review and the PRospective Observational Multi-center Major Trauma Transfusion (PROMMTT) study with standardized quality assurance. The second objective was to assess the impact of missing data on clinical prediction algorithms by evaluating blood transfusion prediction models using PROMMTT data. RS (2005-06) and PROMMTT (2009-10) investigated trauma patients receiving ≥ 1 unit of red blood cells (RBC) from ten Level I trauma centers. Missing data were compared for 33 variables collected in both studies using mixed effects logistic regression (including random intercepts for study site). Massive transfusion (MT) patients received ≥ 10 RBC units within 24h of admission. Correct classification percentages for three MT prediction models were evaluated using complete case analysis and multiple imputation based on the multivariate normal distribution. A sensitivity analysis for missing data was conducted to estimate the upper and lower bounds of correct classification using assumptions about missing data under best and worst case scenarios. Most variables (17/33=52%) had <1% missing data in RS and PROMMTT. Of the remaining variables, 50% demonstrated less missingness in PROMMTT, 25% had less missingness in RS, and 25% were similar between studies. Missing percentages for MT prediction variables in PROMMTT ranged from 2.2% (heart rate) to 45% (respiratory rate). For variables missing >1%, study site was associated with missingness (all p≤0.021). Survival time predicted missingness for 50% of RS and 60% of PROMMTT variables. MT models complete case proportions ranged from 41% to 88%. Complete case analysis and multiple imputation demonstrated similar correct classification results. Sensitivity analysis upper-lower bound ranges for the three MT models were 59-63%, 36-46%, and 46-58%. Prospective collection of ten-fold more variables with data quality assurance reduced overall missing data. Study site and patient survival were associated with missingness, suggesting that data were not missing completely at random, and complete case analysis may lead to biased results. Evaluating clinical prediction model accuracy may be misleading in the presence of missing data, especially with many predictor variables. The proposed sensitivity analysis estimating correct classification under upper (best case scenario)/lower (worst case scenario) bounds may be more informative than multiple imputation, which provided results similar to complete case analysis.^

Relevância:

60.00% 60.00%

Publicador:

Resumo:

La diabetes mellitus es el conjunto de alteraciones provocadas por un defecto en la cantidad de insulina secretada o por un aprovechamiento deficiente de la misma. Es causa directa de complicaciones a corto, medio y largo plazo que disminuyen la calidad y las expectativas de vida de las personas con diabetes. La diabetes mellitus es en la actualidad uno de los problemas más importantes de salud. Ha triplicado su prevalencia en los últimos 20 anos y para el año 2025 se espera que existan casi 300 millones de personas con diabetes. Este aumento de la prevalencia junto con la morbi-mortalidad asociada a sus complicaciones micro y macro-vasculares convierten la diabetes en una carga para los sistemas sanitarios, sus recursos económicos y sus profesionales, haciendo de la enfermedad un problema individual y de salud pública de enormes proporciones. De momento no existe cura a esta enfermedad, de modo que el objetivo terapéutico del tratamiento de la diabetes se centra en la normalización de la glucemia intentando minimizar los eventos de hiper e hipoglucemia y evitando la aparición o al menos retrasando la evolución de las complicaciones vasculares, que constituyen la principal causa de morbi-mortalidad de las personas con diabetes. Un adecuado control diabetológico implica un tratamiento individualizado que considere multitud de factores para cada paciente (edad, actividad física, hábitos alimentarios, presencia de complicaciones asociadas o no a la diabetes, factores culturales, etc.). Sin embargo, a corto plazo, las dos variables más influyentes que el paciente ha de manejar para intervenir sobre su nivel glucémico son la insulina administrada y la dieta. Ambas presentan un retardo entre el momento de su aplicación y el comienzo de su acción, asociado a la absorción de los mismos. Por este motivo la capacidad de predecir la evolución del perfil glucémico en un futuro cercano, ayudara al paciente a tomar las decisiones adecuadas para mantener un buen control de su enfermedad y evitar situaciones de riesgo. Este es el objetivo de la predicción en diabetes: adelantar la evolución del perfil glucémico en un futuro cercano para ayudar al paciente a adaptar su estilo de vida y sus acciones correctoras, con el propósito de que sus niveles de glucemia se aproximen a los de una persona sana, evitando así los síntomas y complicaciones de un mal control. La aparición reciente de los sistemas de monitorización continua de glucosa ha proporcionado nuevas alternativas. La disponibilidad de un registro exhaustivo de las variaciones del perfil glucémico, con un periodo de muestreo de entre uno y cinco minutos, ha favorecido el planteamiento de nuevos modelos que tratan de predecir la glucemia utilizando tan solo las medidas anteriores de glucemia o al menos reduciendo significativamente la información de entrada a los algoritmos. El hecho de requerir menor intervención por parte del paciente, abre nuevas posibilidades de aplicación de los predictores de glucemia, haciéndose viable su uso en tiempo real, como sistemas de ayuda a la decisión, como detectores de situaciones de riesgo o integrados en algoritmos automáticos de control. En esta tesis doctoral se proponen diferentes algoritmos de predicción de glucemia para pacientes con diabetes, basados en la información registrada por un sistema de monitorización continua de glucosa así como incorporando la información de la insulina administrada y la ingesta de carbohidratos. Los algoritmos propuestos han sido evaluados en simulación y utilizando datos de pacientes registrados en diferentes estudios clínicos. Para ello se ha desarrollado una amplia metodología, que trata de caracterizar las prestaciones de los modelos de predicción desde todos los puntos de vista: precisión, retardo, ruido y capacidad de detección de situaciones de riesgo. Se han desarrollado las herramientas de simulación necesarias y se han analizado y preparado las bases de datos de pacientes. También se ha probado uno de los algoritmos propuestos para comprobar la validez de la predicción en tiempo real en un escenario clínico. Se han desarrollado las herramientas que han permitido llevar a cabo el protocolo experimental definido, en el que el paciente consulta la predicción bajo demanda y tiene el control sobre las variables metabólicas. Este experimento ha permitido valorar el impacto sobre el control glucémico del uso de la predicción de glucosa. ABSTRACT Diabetes mellitus is the set of alterations caused by a defect in the amount of secreted insulin or a suboptimal use of insulin. It causes complications in the short, medium and long term that affect the quality of life and reduce the life expectancy of people with diabetes. Diabetes mellitus is currently one of the most important health problems. Prevalence has tripled in the past 20 years and estimations point out that it will affect almost 300 million people by 2025. Due to this increased prevalence, as well as to morbidity and mortality associated with micro- and macrovascular complications, diabetes has become a burden on health systems, their financial resources and their professionals, thus making the disease a major individual and a public health problem. There is currently no cure for this disease, so that the therapeutic goal of diabetes treatment focuses on normalizing blood glucose events. The aim is to minimize hyper- and hypoglycemia and to avoid, or at least to delay, the appearance and development of vascular complications, which are the main cause of morbidity and mortality among people with diabetes. A suitable, individualized and controlled treatment for diabetes involves many factors that need to be considered for each patient: age, physical activity, eating habits, presence of complications related or unrelated to diabetes, cultural factors, etc. However, in the short term, the two most influential variables that the patient has available in order to manage his/her glycemic levels are administered insulin doses and diet. Both suffer from a delay between their time of application and the onset of the action associated with their absorption. Therefore, the ability to predict the evolution of the glycemic profile in the near future could help the patient to make appropriate decisions on how to maintain good control of his/her disease and to avoid risky situations. Hence, the main goal of glucose prediction in diabetes consists of advancing the evolution of glycemic profiles in the near future. This would assist the patient in adapting his/her lifestyle and in taking corrective actions in a way that blood glucose levels approach those of a healthy person, consequently avoiding the symptoms and complications of a poor glucose control. The recent emergence of continuous glucose monitoring systems has provided new alternatives in this field. The availability of continuous records of changes in glycemic profiles (with a sampling period of one or five minutes) has enabled the design of new models which seek to predict blood glucose by using automatically read glucose measurements only (or at least, reducing significantly the data input manually to the algorithms). By requiring less intervention by the patient, new possibilities are open for the application of glucose predictors, making its use feasible in real-time applications, such as: decision support systems, hypo- and hyperglycemia detectors, integration into automated control algorithms, etc. In this thesis, different glucose prediction algorithms are proposed for patients with diabetes. These are based on information recorded by a continuous glucose monitoring system and incorporate information of the administered insulin and carbohydrate intakes. The proposed algorithms have been evaluated in-silico and using patients’ data recorded in different clinical trials. A complete methodology has been developed to characterize the performance of predictive models from all points of view: accuracy, delay, noise and ability to detect hypo- and hyperglycemia. In addition, simulation tools and patient databases have been deployed. One of the proposed algorithms has additionally been evaluated in terms of real-time prediction performance in a clinical scenario in which the patient checked his/her glucose predictions on demand and he/she had control on his/her metabolic variables. This has allowed assessing the impact of using glucose prediction on glycemic control. The tools to carry out the defined experimental protocols were also developed in this thesis.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Promiscuous human leukocyte antigen (HLA) binding peptides are ideal targets for vaccine development. Existing computational models for prediction of promiscuous peptides used hidden Markov models and artificial neural networks as prediction algorithms. We report a system based on support vector machines that outperforms previously published methods. Preliminary testing showed that it can predict peptides binding to HLA-A2 and -A3 super-type molecules with excellent accuracy, even for molecules where no binding data are currently available.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Very large spatially-referenced datasets, for example, those derived from satellite-based sensors which sample across the globe or large monitoring networks of individual sensors, are becoming increasingly common and more widely available for use in environmental decision making. In large or dense sensor networks, huge quantities of data can be collected over small time periods. In many applications the generation of maps, or predictions at specific locations, from the data in (near) real-time is crucial. Geostatistical operations such as interpolation are vital in this map-generation process and in emergency situations, the resulting predictions need to be available almost instantly, so that decision makers can make informed decisions and define risk and evacuation zones. It is also helpful when analysing data in less time critical applications, for example when interacting directly with the data for exploratory analysis, that the algorithms are responsive within a reasonable time frame. Performing geostatistical analysis on such large spatial datasets can present a number of problems, particularly in the case where maximum likelihood. Although the storage requirements only scale linearly with the number of observations in the dataset, the computational complexity in terms of memory and speed, scale quadratically and cubically respectively. Most modern commodity hardware has at least 2 processor cores if not more. Other mechanisms for allowing parallel computation such as Grid based systems are also becoming increasingly commonly available. However, currently there seems to be little interest in exploiting this extra processing power within the context of geostatistics. In this paper we review the existing parallel approaches for geostatistics. By recognising that diffeerent natural parallelisms exist and can be exploited depending on whether the dataset is sparsely or densely sampled with respect to the range of variation, we introduce two contrasting novel implementations of parallel algorithms based on approximating the data likelihood extending the methods of Vecchia [1988] and Tresp [2000]. Using parallel maximum likelihood variogram estimation and parallel prediction algorithms we show that computational time can be significantly reduced. We demonstrate this with both sparsely sampled data and densely sampled data on a variety of architectures ranging from the common dual core processor, found in many modern desktop computers, to large multi-node super computers. To highlight the strengths and weaknesses of the diffeerent methods we employ synthetic data sets and go on to show how the methods allow maximum likelihood based inference on the exhaustive Walker Lake data set.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Improvements in genomic technology, both in the increased speed and reduced cost of sequencing, have expanded the appreciation of the abundance of human genetic variation. However the sheer amount of variation, as well as the varying type and genomic content of variation, poses a challenge in understanding the clinical consequence of a single mutation. This work uses several methodologies to interpret the observed variation in the human genome, and presents novel strategies for the prediction of allele pathogenicity.

Using the zebrafish model system as an in vivo assay of allele function, we identified a novel driver of Bardet-Biedl Syndrome (BBS) in CEP76. A combination of targeted sequencing of 785 cilia-associated genes in a cohort of BBS patients and subsequent in vivo functional assays recapitulating the human phenotype gave strong evidence for the role of CEP76 mutations in the pathology of an affected family. This portion of the work demonstrated the necessity of functional testing in validating disease-associated mutations, and added to the catalogue of known BBS disease genes.

Further study into the role of copy-number variations (CNVs) in a cohort of BBS patients showed the significant contribution of CNVs to disease pathology. Using high-density array comparative genomic hybridization (aCGH) we were able to identify pathogenic CNVs as small as several hundred bp. Dissection of constituent gene and in vivo experiments investigating epistatic interactions between affected genes allowed for an appreciation of several paradigms by which CNVs can contribute to disease. This study revealed that the contribution of CNVs to disease in BBS patients is much higher than previously expected, and demonstrated the necessity of consideration of CNV contribution in future (and retrospective) investigations of human genetic disease.

Finally, we used a combination of comparative genomics and in vivo complementation assays to identify second-site compensatory modification of pathogenic alleles. These pathogenic alleles, which are found compensated in other species (termed compensated pathogenic deviations [CPDs]), represent a significant fraction (from 3 – 10%) of human disease-associated alleles. In silico pathogenicity prediction algorithms, a valuable method of allele prioritization, often misrepresent these alleles as benign, leading to omission of possibly informative variants in studies of human genetic disease. We created a mathematical model that was able to predict CPDs and putative compensatory sites, and functionally showed in vivo that second-site mutation can mitigate the pathogenicity of disease alleles. Additionally, we made publically available an in silico module for the prediction of CPDs and modifier sites.

These studies have advanced the ability to interpret the pathogenicity of multiple types of human variation, as well as made available tools for others to do so as well.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Protein-protein interaction networks constructed by high throughput technologies provide opportunities for predicting protein functions. A lot of approaches and algorithms have been applied on PPI networks to predict functions of unannotated proteins over recent decades. However, most of existing algorithms and approaches do not consider unannotated proteins and their corresponding interactions in the prediction process. On the other hand, algorithms which make use of unannotated proteins have limited prediction performance. Moreover, current algorithms are usually one-off predictions. In this paper, we propose an iterative approach that utilizes unannotated proteins and their interactions in prediction. We conducted experiments to evaluate the performance and robustness of the proposed iterative approach. The iterative approach maximally improved the prediction performance by 50%-80% when there was a high proportion of unannotated neighborhood protein in the network. The iterative approach also showed robustness in various types of protein interaction network. Importantly, our iterative approach initially proposes an idea that iteratively incorporates the interaction information of unannotated proteins into the protein function prediction and can be applied on existing prediction algorithms to improve prediction performance.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Clinically HER2+ (cHER2+) breast cancer (BC), as exclusively determined by immunohistochemistry of HER2 protein overexpression and/or fluorescence in situ hybridization of HER2 gene amplification, has been largely considered a single disease entity in terms of clinical outcome and in the susceptibility to the anti-HER2 monoclonal antibody trastuzumab (Herceptin). However, although the adjuvant/neoadjuvant use of the trastuzumab has been shown to significantly reduce recurrence risk when added to standard chemotherapy in women with early-stage cHER2+ BC, not all cases derive similar benefit from trastuzumab because a significant number of cHER2+ BC patients develop disease recurrence. Unfortunately, the identification of a robust clinical predictor of trastuzumab benefit, including HER2 itself, has proven challenging in the adjuvant/neoadjuvant setting. Thus, we suggest that a new generation of research needs to refine the prognostic taxonomy of cHER2+ BC and develop easy-to-use, clinicbased prediction algorithms to distinguish between good- and poor- responders to trastuzumab-based therapy ab initio. This study offered two hypotheses: 1.) HER2 overexpression can unexpectedly take place in a molecular background owned by basal-like BC (a commonly HER2-negative BC subtype which possesses many epithelial-mesenchymal transition (EMT) characteristics and exhibits robust cancer stem cell [CSC]-like features), thus generating a so-called basal/cHER2+ BC subtype; 2.) the basal/cHER2+ phenotype confers poor prognosis and delineates a subgroup of intrinsically aggressive cHER2+ BC with primary resistance to trastuzumab...

Relevância:

40.00% 40.00%

Publicador:

Resumo:

We present an efficient graph-based algorithm for quantifying the similarity of household-level energy use profiles, using a notion of similarity that allows for small time–shifts when comparing profiles. Experimental results on a real smart meter data set demonstrate that in cases of practical interest our technique is far faster than the existing method for computing the same similarity measure. Having a fast algorithm for measuring profile similarity improves the efficiency of tasks such as clustering of customers and cross-validation of forecasting methods using historical data. Furthermore, we apply a generalisation of our algorithm to produce substantially better household-level energy use forecasts from historical smart meter data.