781 resultados para Machine Learning Algorithm


Relevância:

100.00% 100.00%

Publicador:

Resumo:

We present a general approach to forming structure-activity relationships (SARs). This approach is based on representing chemical structure by atoms and their bond connectivities in combination with the inductive logic programming (ILP) algorithm PROGOL. Existing SAR methods describe chemical structure by using attributes which are general properties of an object. It is not possible to map chemical structure directly to attribute-based descriptions, as such descriptions have no internal organization. A more natural and general way to describe chemical structure is to use a relational description, where the internal construction of the description maps that of the object described. Our atom and bond connectivities representation is a relational description. ILP algorithms can form SARs with relational descriptions. We have tested the relational approach by investigating the SARs of 230 aromatic and heteroaromatic nitro compounds. These compounds had been split previously into two subsets, 188 compounds that were amenable to regression and 42 that were not. For the 188 compounds, a SAR was found that was as accurate as the best statistical or neural network-generated SARs. The PROGOL SAR has the advantages that it did not need the use of any indicator variables handcrafted by an expert, and the generated rules were easily comprehensible. For the 42 compounds, PROGOL formed a SAR that was significantly (P < 0.025) more accurate than linear regression, quadratic regression, and back-propagation. This SAR is based on an automatically generated structural alert for mutagenicity.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-06

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We present the results of applying automated machine learning techniques to the problem of matching different object catalogues in astrophysics. In this study, we take two partially matched catalogues where one of the two catalogues has a large positional uncertainty. The two catalogues we used here were taken from the H I Parkes All Sky Survey (HIPASS) and SuperCOSMOS optical survey. Previous work had matched 44 per cent (1887 objects) of HIPASS to the SuperCOSMOS catalogue. A supervised learning algorithm was then applied to construct a model of the matched portion of our catalogue. Validation of the model shows that we achieved a good classification performance (99.12 per cent correct). Applying this model to the unmatched portion of the catalogue found 1209 new matches. This increases the catalogue size from 1887 matched objects to 3096. The combination of these procedures yields a catalogue that is 72 per cent matched.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Today, the data available to tackle many scientific challenges is vast in quantity and diverse in nature. The exploration of heterogeneous information spaces requires suitable mining algorithms as well as effective visual interfaces. Most existing systems concentrate either on mining algorithms or on visualization techniques. Though visual methods developed in information visualization have been helpful, for improved understanding of a complex large high-dimensional dataset, there is a need for an effective projection of such a dataset onto a lower-dimension (2D or 3D) manifold. This paper introduces a flexible visual data mining framework which combines advanced projection algorithms developed in the machine learning domain and visual techniques developed in the information visualization domain. The framework follows Shneiderman’s mantra to provide an effective user interface. The advantage of such an interface is that the user is directly involved in the data mining process. We integrate principled projection methods, such as Generative Topographic Mapping (GTM) and Hierarchical GTM (HGTM), with powerful visual techniques, such as magnification factors, directional curvatures, parallel coordinates, billboarding, and user interaction facilities, to provide an integrated visual data mining framework. Results on a real life high-dimensional dataset from the chemoinformatics domain are also reported and discussed. Projection results of GTM are analytically compared with the projection results from other traditional projection methods, and it is also shown that the HGTM algorithm provides additional value for large datasets. The computational complexity of these algorithms is discussed to demonstrate their suitability for the visual data mining framework.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background and aims: Machine learning techniques for the text mining of cancer-related clinical documents have not been sufficiently explored. Here some techniques are presented for the pre-processing of free-text breast cancer pathology reports, with the aim of facilitating the extraction of information relevant to cancer staging.

Materials and methods: The first technique was implemented using the freely available software RapidMiner to classify the reports according to their general layout: ‘semi-structured’ and ‘unstructured’. The second technique was developed using the open source language engineering framework GATE and aimed at the prediction of chunks of the report text containing information pertaining to the cancer morphology, the tumour size, its hormone receptor status and the number of positive nodes. The classifiers were trained and tested respectively on sets of 635 and 163 manually classified or annotated reports, from the Northern Ireland Cancer Registry.

Results: The best result of 99.4% accuracy – which included only one semi-structured report predicted as unstructured – was produced by the layout classifier with the k nearest algorithm, using the binary term occurrence word vector type with stopword filter and pruning. For chunk recognition, the best results were found using the PAUM algorithm with the same parameters for all cases, except for the prediction of chunks containing cancer morphology. For semi-structured reports the performance ranged from 0.97 to 0.94 and from 0.92 to 0.83 in precision and recall, while for unstructured reports performance ranged from 0.91 to 0.64 and from 0.68 to 0.41 in precision and recall. Poor results were found when the classifier was trained on semi-structured reports but tested on unstructured.

Conclusions: These results show that it is possible and beneficial to predict the layout of reports and that the accuracy of prediction of which segments of a report may contain certain information is sensitive to the report layout and the type of information sought.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-08

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Evolutionary algorithms alone cannot solve optimization problems very efficiently since there are many random (not very rational) decisions in these algorithms. Combination of evolutionary algorithms and other techniques have been proven to be an efficient optimization methodology. In this talk, I will explain the basic ideas of our three algorithms along this line (1): Orthogonal genetic algorithm which treats crossover/mutation as an experimental design problem, (2) Multiobjective evolutionary algorithm based on decomposition (MOEA/D) which uses decomposition techniques from traditional mathematical programming in multiobjective optimization evolutionary algorithm, and (3) Regular model based multiobjective estimation of distribution algorithms (RM-MEDA) which uses the regular property and machine learning methods for improving multiobjective evolutionary algorithms.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this thesis, a machine learning approach was used to develop a predictive model for residual methanol concentration in industrial formalin produced at the Akzo Nobel factory in Kristinehamn, Sweden. The MATLABTM computational environment supplemented with the Statistics and Machine LearningTM toolbox from the MathWorks were used to test various machine learning algorithms on the formalin production data from Akzo Nobel. As a result, the Gaussian Process Regression algorithm was found to provide the best results and was used to create the predictive model. The model was compiled to a stand-alone application with a graphical user interface using the MATLAB CompilerTM.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The aim of this thesis project is to automatically localize HCC tumors in the human liver and subsequently predict if the tumor will undergo microvascular infiltration (MVI), the initial stage of metastasis development. The input data for the work have been partially supplied by Sant'Orsola Hospital and partially downloaded from online medical databases. Two Unet models have been implemented for the automatic segmentation of the livers and the HCC malignancies within it. The segmentation models have been evaluated with the Intersection-over-Union and the Dice Coefficient metrics. The outcomes obtained for the liver automatic segmentation are quite good (IOU = 0.82; DC = 0.35); the outcomes obtained for the tumor automatic segmentation (IOU = 0.35; DC = 0.46) are, instead, affected by some limitations: it can be state that the algorithm is almost always able to detect the location of the tumor, but it tends to underestimate its dimensions. The purpose is to achieve the CT images of the HCC tumors, necessary for features extraction. The 14 Haralick features calculated from the 3D-GLCM, the 120 Radiomic features and the patients' clinical information are collected to build a dataset of 153 features. Now, the goal is to build a model able to discriminate, based on the features given, the tumors that will undergo MVI and those that will not. This task can be seen as a classification problem: each tumor needs to be classified either as “MVI positive” or “MVI negative”. Techniques for features selection are implemented to identify the most descriptive features for the problem at hand and then, a set of classification models are trained and compared. Among all, the models with the best performances (around 80-84% ± 8-15%) result to be the XGBoost Classifier, the SDG Classifier and the Logist Regression models (without penalization and with Lasso, Ridge or Elastic Net penalization).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the framework of industrial problems, the application of Constrained Optimization is known to have overall very good modeling capability and performance and stands as one of the most powerful, explored, and exploited tool to address prescriptive tasks. The number of applications is huge, ranging from logistics to transportation, packing, production, telecommunication, scheduling, and much more. The main reason behind this success is to be found in the remarkable effort put in the last decades by the OR community to develop realistic models and devise exact or approximate methods to solve the largest variety of constrained or combinatorial optimization problems, together with the spread of computational power and easily accessible OR software and resources. On the other hand, the technological advancements lead to a data wealth never seen before and increasingly push towards methods able to extract useful knowledge from them; among the data-driven methods, Machine Learning techniques appear to be one of the most promising, thanks to its successes in domains like Image Recognition, Natural Language Processes and playing games, but also the amount of research involved. The purpose of the present research is to study how Machine Learning and Constrained Optimization can be used together to achieve systems able to leverage the strengths of both methods: this would open the way to exploiting decades of research on resolution techniques for COPs and constructing models able to adapt and learn from available data. In the first part of this work, we survey the existing techniques and classify them according to the type, method, or scope of the integration; subsequently, we introduce a novel and general algorithm devised to inject knowledge into learning models through constraints, Moving Target. In the last part of the thesis, two applications stemming from real-world projects and done in collaboration with Optit will be presented.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Whole Exome Sequencing (WES) is rapidly becoming the first-tier test in clinics, both thanks to its declining costs and the development of new platforms that help clinicians in the analysis and interpretation of SNV and InDels. However, we still know very little on how CNV detection could increase WES diagnostic yield. A plethora of exome CNV callers have been published over the years, all showing good performances towards specific CNV classes and sizes, suggesting that the combination of multiple tools is needed to obtain an overall good detection performance. Here we present TrainX, a ML-based method for calling heterozygous CNVs in WES data using EXCAVATOR2 Normalized Read Counts. We select males and females’ non pseudo-autosomal chromosome X alignments to construct our dataset and train our model, make predictions on autosomes target regions and use HMM to call CNVs. We compared TrainX against a set of CNV tools differing for the detection method (GATK4 gCNV, ExomeDepth, DECoN, CNVkit and EXCAVATOR2) and found that our algorithm outperformed them in terms of stability, as we identified both deletions and duplications with good scores (0.87 and 0.82 F1-scores respectively) and for sizes reaching the minimum resolution of 2 target regions. We also evaluated the method robustness using a set of WES and SNP array data (n=251), part of the Italian cohort of Epi25 collaborative, and were able to retrieve all clinical CNVs previously identified by the SNP array. TrainX showed good accuracy in detecting heterozygous CNVs of different sizes, making it a promising tool to use in a diagnostic setting.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Three-Dimensional Single-Bin-Size Bin Packing Problem is one of the most studied problem in the Cutting & Packing category. From a strictly mathematical point of view, it consists of packing a finite set of strongly heterogeneous “small” boxes, called items, into a finite set of identical “large” rectangles, called bins, minimizing the unused volume and requiring that the items are packed without overlapping. The great interest is mainly due to the number of real-world applications in which it arises, such as pallet and container loading, cutting objects out of a piece of material and packaging design. Depending on these real-world applications, more objective functions and more practical constraints could be needed. After a brief discussion about the real-world applications of the problem and a exhaustive literature review, the design of a two-stage algorithm to solve the aforementioned problem is presented. The algorithm must be able to provide the spatial coordinates of the placed boxes vertices and also the optimal boxes input sequence, while guaranteeing geometric, stability, fragility constraints and a reduced computational time. Due to NP-hard complexity of this type of combinatorial problems, a fusion of metaheuristic and machine learning techniques is adopted. In particular, a hybrid genetic algorithm coupled with a feedforward neural network is used. In the first stage, a rich dataset is created starting from a set of real input instances provided by an industrial company and the feedforward neural network is trained on it. After its training, given a new input instance, the hybrid genetic algorithm is able to run using the neural network output as input parameter vector, providing as output the optimal solution. The effectiveness of the proposed works is confirmed via several experimental tests.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In recent decades, two prominent trends have influenced the data modeling field, namely network analysis and machine learning. This thesis explores the practical applications of these techniques within the domain of drug research, unveiling their multifaceted potential for advancing our comprehension of complex biological systems. The research undertaken during this PhD program is situated at the intersection of network theory, computational methods, and drug research. Across six projects presented herein, there is a gradual increase in model complexity. These projects traverse a diverse range of topics, with a specific emphasis on drug repurposing and safety in the context of neurological diseases. The aim of these projects is to leverage existing biomedical knowledge to develop innovative approaches that bolster drug research. The investigations have produced practical solutions, not only providing insights into the intricacies of biological systems, but also allowing the creation of valuable tools for their analysis. In short, the achievements are: • A novel computational algorithm to identify adverse events specific to fixed-dose drug combinations. • A web application that tracks the clinical drug research response to SARS-CoV-2. • A Python package for differential gene expression analysis and the identification of key regulatory "switch genes". • The identification of pivotal events causing drug-induced impulse control disorders linked to specific medications. • An automated pipeline for discovering potential drug repurposing opportunities. • The creation of a comprehensive knowledge graph and development of a graph machine learning model for predictions. Collectively, these projects illustrate diverse applications of data science and network-based methodologies, highlighting the profound impact they can have in supporting drug research activities.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background There is a wide variation of recurrence risk of Non-small-cell lung cancer (NSCLC) within the same Tumor Node Metastasis (TNM) stage, suggesting that other parameters are involved in determining this probability. Radiomics allows extraction of quantitative information from images that can be used for clinical purposes. The primary objective of this study is to develop a radiomic prognostic model that predicts a 3 year disease free-survival (DFS) of resected Early Stage (ES) NSCLC patients. Material and Methods 56 pre-surgery non contrast Computed Tomography (CT) scans were retrieved from the PACS of our institution and anonymized. Then they were automatically segmented with an open access deep learning pipeline and reviewed by an experienced radiologist to obtain 3D masks of the NSCLC. Images and masks underwent to resampling normalization and discretization. From the masks hundreds Radiomic Features (RF) were extracted using Py-Radiomics. Hence, RF were reduced to select the most representative features. The remaining RF were used in combination with Clinical parameters to build a DFS prediction model using Leave-one-out cross-validation (LOOCV) with Random Forest. Results and Conclusion A poor agreement between the radiologist and the automatic segmentation algorithm (DICE score of 0.37) was found. Therefore, another experienced radiologist manually segmented the lesions and only stable and reproducible RF were kept. 50 RF demonstrated a high correlation with the DFS but only one was confirmed when clinicopathological covariates were added: Busyness a Neighbouring Gray Tone Difference Matrix (HR 9.610). 16 clinical variables (which comprised TNM) were used to build the LOOCV model demonstrating a higher Area Under the Curve (AUC) when RF were included in the analysis (0.67 vs 0.60) but the difference was not statistically significant (p=0,5147).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In recent times, a significant research effort has been focused on how deformable linear objects (DLOs) can be manipulated for real world applications such as assembly of wiring harnesses for the automotive and aerospace sector. This represents an open topic because of the difficulties in modelling accurately the behaviour of these objects and simulate a task involving their manipulation, considering a variety of different scenarios. These problems have led to the development of data-driven techniques in which machine learning techniques are exploited to obtain reliable solutions. However, this approach makes the solution difficult to be extended, since the learning must be replicated almost from scratch as the scenario changes. It follows that some model-based methodology must be introduced to generalize the results and reduce the training effort accordingly. The objective of this thesis is to develop a solution for the DLOs manipulation to assemble a wiring harness for the automotive sector based on adaptation of a base trajectory set by means of reinforcement learning methods. The idea is to create a trajectory planning software capable of solving the proposed task, reducing where possible the learning time, which is done in real time, but at the same time presenting suitable performance and reliability. The solution has been implemented on a collaborative 7-DOFs Panda robot at the Laboratory of Automation and Robotics of the University of Bologna. Experimental results are reported showing how the robot is capable of optimizing the manipulation of the DLOs gaining experience along the task repetition, but showing at the same time a high success rate from the very beginning of the learning phase.