984 resultados para Training sets


Relevância:

60.00% 60.00%

Publicador:

Resumo:

BACKGROUND: Early detection and treatment of colorectal adenomatous polyps (AP) and colorectal cancer (CRC) is associated with decreased mortality for CRC. However, accurate, non-invasive and compliant tests to screen for AP and early stages of CRC are not yet available. A blood-based screening test is highly attractive due to limited invasiveness and high acceptance rate among patients. AIM: To demonstrate whether gene expression signatures in the peripheral blood mononuclear cells (PBMC) were able to detect the presence of AP and early stages CRC. METHODS: A total of 85 PBMC samples derived from colonoscopy-verified subjects without lesion (controls) (n = 41), with AP (n = 21) or with CRC (n = 23) were used as training sets. A 42-gene panel for CRC and AP discrimination, including genes identified by Digital Gene Expression-tag profiling of PBMC, and genes previously characterised and reported in the literature, was validated on the training set by qPCR. Logistic regression analysis followed by bootstrap validation determined CRC- and AP-specific classifiers, which discriminate patients with CRC and AP from controls. RESULTS: The CRC and AP classifiers were able to detect CRC with a sensitivity of 78% and AP with a sensitivity of 46% respectively. Both classifiers had a specificity of 92% with very low false-positive detection when applied on subjects with inflammatory bowel disease (n = 23) or tumours other than CRC (n = 14). CONCLUSION: This pilot study demonstrates the potential of developing a minimally invasive, accurate test to screen patients at average risk for colorectal cancer, based on gene expression analysis of peripheral blood mononuclear cells obtained from a simple blood sample.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Cette thèse envisage un ensemble de méthodes permettant aux algorithmes d'apprentissage statistique de mieux traiter la nature séquentielle des problèmes de gestion de portefeuilles financiers. Nous débutons par une considération du problème général de la composition d'algorithmes d'apprentissage devant gérer des tâches séquentielles, en particulier celui de la mise-à-jour efficace des ensembles d'apprentissage dans un cadre de validation séquentielle. Nous énumérons les desiderata que des primitives de composition doivent satisfaire, et faisons ressortir la difficulté de les atteindre de façon rigoureuse et efficace. Nous poursuivons en présentant un ensemble d'algorithmes qui atteignent ces objectifs et présentons une étude de cas d'un système complexe de prise de décision financière utilisant ces techniques. Nous décrivons ensuite une méthode générale permettant de transformer un problème de décision séquentielle non-Markovien en un problème d'apprentissage supervisé en employant un algorithme de recherche basé sur les K meilleurs chemins. Nous traitons d'une application en gestion de portefeuille où nous entraînons un algorithme d'apprentissage à optimiser directement un ratio de Sharpe (ou autre critère non-additif incorporant une aversion au risque). Nous illustrons l'approche par une étude expérimentale approfondie, proposant une architecture de réseaux de neurones spécialisée à la gestion de portefeuille et la comparant à plusieurs alternatives. Finalement, nous introduisons une représentation fonctionnelle de séries chronologiques permettant à des prévisions d'être effectuées sur un horizon variable, tout en utilisant un ensemble informationnel révélé de manière progressive. L'approche est basée sur l'utilisation des processus Gaussiens, lesquels fournissent une matrice de covariance complète entre tous les points pour lesquels une prévision est demandée. Cette information est utilisée à bon escient par un algorithme qui transige activement des écarts de cours (price spreads) entre des contrats à terme sur commodités. L'approche proposée produit, hors échantillon, un rendement ajusté pour le risque significatif, après frais de transactions, sur un portefeuille de 30 actifs.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Background: This study describes a bioinformatics approach designed to identify Plasmodium vivax proteins potentially involved in reticulocyte invasion. Specifically, different protein training sets were built and tuned based on different biological parameters, such as experimental evidence of secretion and/or involvement in invasion-related processes. A profile-based sequence method supported by hidden Markov models (HMMs) was then used to build classifiers to search for biologically-related proteins. The transcriptional profile of the P. vivax intra-erythrocyte developmental cycle was then screened using these classifiers. Results: A bioinformatics methodology for identifying potentially secreted P. vivax proteins was designed using sequence redundancy reduction and probabilistic profiles. This methodology led to identifying a set of 45 proteins that are potentially secreted during the P. vivax intra-erythrocyte development cycle and could be involved in cell invasion. Thirteen of the 45 proteins have already been described as vaccine candidates; there is experimental evidence of protein expression for 7 of the 32 remaining ones, while no previous studies of expression, function or immunology have been carried out for the additional 25. Conclusions: The results support the idea that probabilistic techniques like profile HMMs improve similarity searches. Also, different adjustments such as sequence redundancy reduction using Pisces or Cd-Hit allowed data clustering based on rational reproducible measurements. This kind of approach for selecting proteins with specific functions is highly important for supporting large-scale analyses that could aid in the identification of genes encoding potential new target antigens for vaccine development and drug design. The present study has led to targeting 32 proteins for further testing regarding their ability to induce protective immune responses against P. vivax malaria.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Aquesta tesi està emmarcada dins la detecció precoç de masses, un dels símptomes més clars del càncer de mama, en imatges mamogràfiques. Primerament, s'ha fet un anàlisi extensiu dels diferents mètodes de la literatura, concloent que aquests mètodes són dependents de diferent paràmetres: el tamany i la forma de la massa i la densitat de la mama. Així, l'objectiu de la tesi és analitzar, dissenyar i implementar un mètode de detecció robust i independent d'aquests tres paràmetres. Per a tal fi, s'ha construït un patró deformable de la massa a partir de l'anàlisi de masses reals i, a continuació, aquest model és buscat en les imatges seguint un esquema probabilístic, obtenint una sèrie de regions sospitoses. Fent servir l'anàlisi 2DPCA, s'ha construït un algorisme capaç de discernir aquestes regions són realment una massa o no. La densitat de la mama és un paràmetre que s'introdueix de forma natural dins l'algorisme.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Can infants below age 1 year learn words in one context and understand them in another? To investigate this question, two groups of parents trained infants from age 9 months on 8 categories of common objects. A control group received no training. At 12 months, infants in the experimental groups, but not in the control group, showed comprehension of the words in a new context. It appears that infants under 1 year old can learn words in a decontextualized, as distinct from a context-bound, fashion. Perceptual variability within the to-be-learned categories, and the perceptual similarity between training sets and the novel test items, did not appear to affect this learning.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Whilst radial basis function (RBF) equalizers have been employed to combat the linear and nonlinear distortions in modern communication systems, most of them do not take into account the equalizer's generalization capability. In this paper, it is firstly proposed that the. model's generalization capability can be improved by treating the modelling problem as a multi-objective optimization (MOO) problem, with each objective based on one of several training sets. Then, as a modelling application, a new RBF equalizer learning scheme is introduced based on the directional evolutionary MOO (EMOO). Directional EMOO improves the computational efficiency of conventional EMOO, which has been widely applied in solving MOO problems, by explicitly making use of the directional information. Computer simulation demonstrates that the new scheme can be used to derive RBF equalizers with good performance not only on explaining the training samples but on predicting the unseen samples.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper, a new equalizer learning scheme is introduced based on the algorithm of the directional evolutionary multi-objective optimization (EMOO). Whilst nonlinear channel equalizers such as the radial basis function (RBF) equalizers have been widely studied to combat the linear and nonlinear distortions in the modern communication systems, most of them do not take into account the equalizers' generalization capabilities. In this paper, equalizers are designed aiming at improving their generalization capabilities. It is proposed that this objective can be achieved by treating the equalizer design problem as a multi-objective optimization (MOO) problem, with each objective based on one of several training sets, followed by deriving equalizers with good capabilities of recovering the signals for all the training sets. Conventional EMOO which is widely applied in the MOO problems suffers from disadvantages such as slow convergence speed. Directional EMOO improves the computational efficiency of the conventional EMOO by explicitly making use of the directional information. The new equalizer learning scheme based on the directional EMOO is applied to the RBF equalizer design. Computer simulation demonstrates that the new scheme can be used to derive RBF equalizers with good generalization capabilities, i.e., good performance on predicting the unseen samples.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Background. Within a therapeutic gene by environment (GxE) framework, we recently demonstrated that variation in the Serotonin Transporter Promoter Polymorphism; 5HTTLPR and marker rs6330 in Nerve Growth Factor gene; NGF is associated with poorer outcomes following cognitive behaviour therapy (CBT) for child anxiety disorders. The aim of this study was to explore one potential means of extending the translational reach of G×E data in a way that may be clinically informative. We describe a ‘risk-index’ approach combining genetic, demographic and clinical data and test its ability to predict diagnostic outcome following CBT in anxious children. Method. DNA and clinical data were collected from 384 children with a primary anxiety disorder undergoing CBT. We tested our risk model in five cross-validation training sets. Results. In predicting treatment outcome, six variables had a minimum mean beta value of 0.5: 5HTTLPR, NGF rs6330, gender, primary anxiety severity, comorbid mood disorder and comorbid externalising disorder. A risk index (range 0-8) constructed from these variables had moderate predictive ability (AUC = .62-.69) in this study. Children scoring high on this index (5-8) were approximately three times as likely to retain their primary anxiety disorder at follow-up as compared to those children scoring 2 or less. Conclusion. Significant genetic, demographic and clinical predictors of outcome following CBT for anxiety-disordered children were identified. Combining these predictors within a risk-index could be used to identify which children are less likely to be diagnosis free following CBT alone or thus require longer or enhanced treatment. The ‘risk-index’ approach represents one means of harnessing the translational potential of G×E data.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The role and function of a given protein is dependent on its structure. In recent years, however, numerous studies have highlighted the importance of unstructured, or disordered regions in governing a protein’s function. Disordered proteins have been found to play important roles in pivotal cellular functions, such as DNA binding and signalling cascades. Studying proteins with extended disordered regions is often problematic as they can be challenging to express, purify and crystallise. This means that interpretable experimental data on protein disorder is hard to generate. As a result, predictive computational tools have been developed with the aim of predicting the level and location of disorder within a protein. Currently, over 60 prediction servers exist, utilizing different methods for classifying disorder and different training sets. Here we review several good performing, publicly available prediction methods, comparing their application and discussing how disorder prediction servers can be used to aid the experimental solution of protein structure. The use of disorder prediction methods allows us to adopt a more targeted approach to experimental studies by accurately identifying the boundaries of ordered protein domains so that they may be investigated separately, thereby increasing the likelihood of their successful experimental solution.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We present a catalogue of galaxy photometric redshifts and k-corrections for the Sloan Digital Sky Survey Data Release 7 (SDSS-DR7), available on the World Wide Web. The photometric redshifts were estimated with an artificial neural network using five ugriz bands, concentration indices and Petrosian radii in the g and r bands. We have explored our redshift estimates with different training sets, thus concluding that the best choice for improving redshift accuracy comprises the main galaxy sample (MGS), the luminous red galaxies and the galaxies of active galactic nuclei covering the redshift range 0 < z < 0.3. For the MGS, the photometric redshift estimates agree with the spectroscopic values within rms = 0.0227. The distribution of photometric redshifts derived in the range 0 < z(phot) < 0.6 agrees well with the model predictions. k-corrections were derived by calibration of the k-correct_v4.2 code results for the MGS with the reference-frame (z = 0.1) (g - r) colours. We adopt a linear dependence of k-corrections on redshift and (g - r) colours that provide suitable distributions of luminosity and colours for galaxies up to redshift z(phot) = 0.6 comparable to the results in the literature. Thus, our k-correction estimate procedure is a powerful, low computational time algorithm capable of reproducing suitable results that can be used for testing galaxy properties at intermediate redshifts using the large SDSS data base.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In order to extend previous SAR and QSAR studies, 3D-QSAR analysis has been performed using CoMFA and CoMSIA approaches applied to a set of 39 alpha-(N)-heterocyclic carboxaldehydes thiosemicarbazones with their inhibitory activity values (IC(50)) evaluated against ribonucleotide reductase (RNR) of H.Ep.-2 cells (human epidermoid carcinoma), taken from selected literature. Both rigid and field alignment methods, taking the unsubstituted 2-formylpyridine thiosemicarbazone in its syn conformation as template, have been used to generate multiple predictive CoMFA and CoMSIA models derived from training sets and validated with the corresponding test sets. Acceptable predictive correlation coefficients (Q(cv)(2) from 0.360 to 0.609 for CoMFA and Q(cv)(2) from 0.394 to 0.580 for CoMSIA models) with high fitted correlation coefficients (r` from 0.881 to 0.981 for CoMFA and r(2) from 0.938 to 0.993 for CoMSIA models) and low standard errors (s from 0.135 to 0.383 for CoMFA and s from 0.098 to 0.240 for CoMSIA models) were obtained. More precise CoMFA and CoMSIA models have been derived considering the subset of thiosemicarbazones (TSC) substituted only at 5-position of the pyridine ring (n=22). Reasonable predictive correlation coefficients (Q(cv)(2) from 0.486 to 0.683 for CoMFA and Q(cv)(2) from 0.565 to 0.791 for CoMSIA models) with high fitted correlation coefficients (r(2) from 0.896 to 0.997 for CoMFA and r(2) from 0.991 to 0.998 for CoMSIA models) and very low standard errors (s from 0.040 to 0.179 for CoMFA and s from 0.029 to 0.068 for CoMSIA models) were obtained. The stability of each CoMFA and CoMSIA models was further assessed by performing bootstrapping analysis. For the two sets the generated CoMSIA models showed, in general, better statistics than the corresponding CoMFA models. The analysis of CoMFA and CoMSIA contour maps suggest that a hydrogen bond acceptor near the nitrogen of the pyridine ring can enhance inhibitory activity values. This observation agrees with literature data, which suggests that the nitrogen pyridine lone pairs can complex with the iron ion leading to species that inhibits RNR. The derived CoMFA and CoMSIA models contribute to understand the structural features of this class of TSC as antitumor agents in terms of steric, electrostatic, hydrophobic and hydrogen bond donor and hydrogen bond acceptor fields as well as to the rational design of this key enzyme inhibitors.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Web data extraction systems are the kernel of information mediators between users and heterogeneous Web data resources. How to extract structured data from semi-structured documents has been a problem of active research. Supervised and unsupervised methods have been devised to learn extraction rules from training sets. However, trying to prepare training sets (especially to annotate them for supervised methods), is very time-consuming. We propose a framework for Web data extraction, which logged usersrsquo access history and exploit them to assist automatic training set generation. We cluster accessed Web documents according to their structural details; define criteria to measure the importance of sub-structures; and then generate extraction rules. We also propose a method to adjust the rules according to historical data. Our experiments confirm the viability of our proposal.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A system that can automatically detect nodules within lung images may assist expert radiologists in interpreting the abnormal patterns as nodules in 2D CT lung images. A system is presented that can automatically identify nodules of various sizes within lung images. The pattern classification method is employed to develop the proposed system. A random forest ensemble classifier is formed consisting of many weak learners that can grow decision trees. The forest selects the decision that has the most votes. The developed system consists of two random forest classifiers connected in a series fashion. A subset of CT lung images from the LIDC database is employed. It consists of 5721 images to train and test the system. There are 411 images that contained expert- radiologists identified nodules. Training sets consisting of nodule, non-nodule, and false-detection patterns are constructed. A collection of test images are also built. The first classifier is developed to detect all nodules. The second classifier is developed to eliminate the false detections produced by the first classifier. According to the experimental results, a true positive rate of 100%, and false positive rate of 1.4 per lung image are achieved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

One of the fundamental machine learning tasks is that of predictive classification. Given that organisations collect an ever increasing amount of data, predictive classification methods must be able to effectively and efficiently handle large amounts of data. However, it is understood that present requirements push existing algorithms to, and sometimes beyond, their limits since many classification prediction algorithms were designed when currently common data set sizes were beyond imagination. This has led to a significant amount of research into ways of making classification learning algorithms more effective and efficient. Although substantial progress has been made, a number of key questions have not been answered. This dissertation investigates two of these key questions. The first is whether different types of algorithms to those currently employed are required when using large data sets. This is answered by analysis of the way in which the bias plus variance decomposition of predictive classification error changes as training set size is increased. Experiments find that larger training sets require different types of algorithms to those currently used. Some insight into the characteristics of suitable algorithms is provided, and this may provide some direction for the development of future classification prediction algorithms which are specifically designed for use with large data sets. The second question investigated is that of the role of sampling in machine learning with large data sets. Sampling has long been used as a means of avoiding the need to scale up algorithms to suit the size of the data set by scaling down the size of the data sets to suit the algorithm. However, the costs of performing sampling have not been widely explored. Two popular sampling methods are compared with learning from all available data in terms of predictive accuracy, model complexity, and execution time. The comparison shows that sub-sampling generally products models with accuracy close to, and sometimes greater than, that obtainable from learning with all available data. This result suggests that it may be possible to develop algorithms that take advantage of the sub-sampling methodology to reduce the time required to infer a model while sacrificing little if any accuracy. Methods of improving effective and efficient learning via sampling are also investigated, and now sampling methodologies proposed. These methodologies include using a varying-proportion of instances to determine the next inference step and using a statistical calculation at each inference step to determine sufficient sample size. Experiments show that using a statistical calculation of sample size can not only substantially reduce execution time but can do so with only a small loss, and occasional gain, in accuracy. One of the common uses of sampling is in the construction of learning curves. Learning curves are often used to attempt to determine the optimal training size which will maximally reduce execution time while nut being detrimental to accuracy. An analysis of the performance of methods for detection of convergence of learning curves is performed, with the focus of the analysis on methods that calculate the gradient, of the tangent to the curve. Given that such methods can be susceptible to local accuracy plateaus, an investigation into the frequency of local plateaus is also performed. It is shown that local accuracy plateaus are a common occurrence, and that ensuring a small loss of accuracy often results in greater computational cost than learning from all available data. These results cast doubt over the applicability of gradient of tangent methods for detecting convergence, and of the viability of learning curves for reducing execution time in general.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Spatial activity recognition in everyday environments is particularly challenging due to noise incorporated during video-tracking. We address the noise issue of spatial recognition with a biologically inspired chemotactic model that is capable of handling noisy data. The model is based on bacterial chemotaxis, a process that allows bacteria to survive by changing motile behaviour in relation to environmental dynamics. Using chemotactic principles, we propose the chemotactic model and evaluate its classification performance in a smart house environment. The model exhibits high classification accuracy (99%) with a diverse 10 class activity dataset and outperforms the discrete hidden Markov model (HMM). High accuracy (>89%) is also maintained across small training sets and through incorporation of varying degrees of artificial noise into testing sequences. Importantly, unlike other bottom–up spatial activity recognition models, we show that the chemotactic model is capable of recognizing simple interwoven activities.