909 resultados para Data accuracy
Resumo:
Deep learning methods are extremely promising machine learning tools to analyze neuroimaging data. However, their potential use in clinical settings is limited because of the existing challenges of applying these methods to neuroimaging data. In this study, first a data leakage type caused by slice-level data split that is introduced during training and validation of a 2D CNN is surveyed and a quantitative assessment of the model’s performance overestimation is presented. Second, an interpretable, leakage-fee deep learning software written in a python language with a wide range of options has been developed to conduct both classification and regression analysis. The software was applied to the study of mild cognitive impairment (MCI) in patients with small vessel disease (SVD) using multi-parametric MRI data where the cognitive performance of 58 patients measured by five neuropsychological tests is predicted using a multi-input CNN model taking brain image and demographic data. Each of the cognitive test scores was predicted using different MRI-derived features. As MCI due to SVD has been hypothesized to be the effect of white matter damage, DTI-derived features MD and FA produced the best prediction outcome of the TMT-A score which is consistent with the existing literature. In a second study, an interpretable deep learning system aimed at 1) classifying Alzheimer disease and healthy subjects 2) examining the neural correlates of the disease that causes a cognitive decline in AD patients using CNN visualization tools and 3) highlighting the potential of interpretability techniques to capture a biased deep learning model is developed. Structural magnetic resonance imaging (MRI) data of 200 subjects was used by the proposed CNN model which was trained using a transfer learning-based approach producing a balanced accuracy of 71.6%. Brain regions in the frontal and parietal lobe showing the cerebral cortex atrophy were highlighted by the visualization tools.
Resumo:
With the advent of new technologies it is increasingly easier to find data of different nature from even more accurate sensors that measure the most disparate physical quantities and with different methodologies. The collection of data thus becomes progressively important and takes the form of archiving, cataloging and online and offline consultation of information. Over time, the amount of data collected can become so relevant that it contains information that cannot be easily explored manually or with basic statistical techniques. The use of Big Data therefore becomes the object of more advanced investigation techniques, such as Machine Learning and Deep Learning. In this work some applications in the world of precision zootechnics and heat stress accused by dairy cows are described. Experimental Italian and German stables were involved for the training and testing of the Random Forest algorithm, obtaining a prediction of milk production depending on the microclimatic conditions of the previous days with satisfactory accuracy. Furthermore, in order to identify an objective method for identifying production drops, compared to the Wood model, typically used as an analytical model of the lactation curve, a Robust Statistics technique was used. Its application on some sample lactations and the results obtained allow us to be confident about the use of this method in the future.
Resumo:
This thesis investigates the legal, ethical, technical, and psychological issues of general data processing and artificial intelligence practices and the explainability of AI systems. It consists of two main parts. In the initial section, we provide a comprehensive overview of the big data processing ecosystem and the main challenges we face today. We then evaluate the GDPR’s data privacy framework in the European Union. The Trustworthy AI Framework proposed by the EU’s High-Level Expert Group on AI (AI HLEG) is examined in detail. The ethical principles for the foundation and realization of Trustworthy AI are analyzed along with the assessment list prepared by the AI HLEG. Then, we list the main big data challenges the European researchers and institutions identified and provide a literature review on the technical and organizational measures to address these challenges. A quantitative analysis is conducted on the identified big data challenges and the measures to address them, which leads to practical recommendations for better data processing and AI practices in the EU. In the subsequent part, we concentrate on the explainability of AI systems. We clarify the terminology and list the goals aimed at the explainability of AI systems. We identify the reasons for the explainability-accuracy trade-off and how we can address it. We conduct a comparative cognitive analysis between human reasoning and machine-generated explanations with the aim of understanding how explainable AI can contribute to human reasoning. We then focus on the technical and legal responses to remedy the explainability problem. In this part, GDPR’s right to explanation framework and safeguards are analyzed in-depth with their contribution to the realization of Trustworthy AI. Then, we analyze the explanation techniques applicable at different stages of machine learning and propose several recommendations in chronological order to develop GDPR-compliant and Trustworthy XAI systems.
Resumo:
Aim of the present study was to develop a statistical approach to define the best cut-off Copy number alterations (CNAs) calling from genomic data provided by high throughput experiments, able to predict a specific clinical end-point (early relapse, 18 months) in the context of Multiple Myeloma (MM). 743 newly diagnosed MM patients with SNPs array-derived genomic and clinical data were included in the study. CNAs were called both by a conventional (classic, CL) and an outcome-oriented (OO) method, and Progression Free Survival (PFS) hazard ratios of CNAs called by the two approaches were compared. The OO approach successfully identified patients at higher risk of relapse and the univariate survival analysis showed stronger prognostic effects for OO-defined high-risk alterations, as compared to that defined by CL approach, statistically significant for 12 CNAs. Overall, 155/743 patients relapsed within 18 months from the therapy start. A small number of OO-defined CNAs were significantly recurrent in early-relapsed patients (ER-CNAs) - amp1q, amp2p, del2p, del12p, del17p, del19p -. Two groups of patients were identified either carrying or not ≥1 ER-CNAs (249 vs. 494, respectively), the first one with significantly shorter PFS and overall survivals (OS) (PFS HR 2.15, p<0001; OS HR 2.37, p<0.0001). The risk of relapse defined by the presence of ≥1 ER-CNAs was independent from those conferred both by R-IIS 3 (HR=1.51; p=0.01) and by low quality (< stable disease) clinical response (HR=2.59 p=0.004). Notably, the type of induction therapy was not descriptive, suggesting that ER is strongly related to patients’ baseline genomic architecture. In conclusion, the OO- approach employed allowed to define CNAs-specific dynamic clonality cut-offs, improving the CNAs calls’ accuracy to identify MM patients with the highest probability to ER. As being outcome-dependent, the OO-approach is dynamic and might be adjusted according to the selected outcome variable of interest.
Resumo:
The present Dissertation shows how recent statistical analysis tools and open datasets can be exploited to improve modelling accuracy in two distinct yet interconnected domains of flood hazard (FH) assessment. In the first Part, unsupervised artificial neural networks are employed as regional models for sub-daily rainfall extremes. The models aim to learn a robust relation to estimate locally the parameters of Gumbel distributions of extreme rainfall depths for any sub-daily duration (1-24h). The predictions depend on twenty morphoclimatic descriptors. A large study area in north-central Italy is adopted, where 2238 annual maximum series are available. Validation is performed over an independent set of 100 gauges. Our results show that multivariate ANNs may remarkably improve the estimation of percentiles relative to the benchmark approach from the literature, where Gumbel parameters depend on mean annual precipitation. Finally, we show that the very nature of the proposed ANN models makes them suitable for interpolating predicted sub-daily rainfall quantiles across space and time-aggregation intervals. In the second Part, decision trees are used to combine a selected blend of input geomorphic descriptors for predicting FH. Relative to existing DEM-based approaches, this method is innovative, as it relies on the combination of three characteristics: (1) simple multivariate models, (2) a set of exclusively DEM-based descriptors as input, and (3) an existing FH map as reference information. First, the methods are applied to northern Italy, represented with the MERIT DEM (∼90m resolution), and second, to the whole of Italy, represented with the EU-DEM (25m resolution). The results show that multivariate approaches may (a) significantly enhance flood-prone areas delineation relative to a selected univariate one, (b) provide accurate predictions of expected inundation depths, (c) produce encouraging results in extrapolation, (d) complete the information of imperfect reference maps, and (e) conveniently convert binary maps into continuous representation of FH.
Resumo:
Modern society is now facing significant difficulties in attempting to preserve its architectural heritage. Numerous challenges arise consequently when it comes to documentation, preservation and restoration. Fortunately, new perspectives on architectural heritage are emerging owing to the rapid development of digitalization. Therefore, this presents new challenges for architects, restorers and specialists. Additionally, this has changed the way they approach the study of existing heritage, changing from conventional 2D drawings in response to the increasing requirement for 3D representations. Recently, Building Information Modelling for historic buildings (HBIM) has escalated as an emerging trend to interconnect geometrical and informational data. Currently, the latest 3D geomatics techniques based on 3D laser scanners with enhanced photogrammetry along with the continuous improvement in the BIM industry allow for an enhanced 3D digital reconstruction of historical and existing buildings. This research study aimed to develop an integrated workflow for the 3D digital reconstruction of heritage buildings starting from a point cloud. The Pieve of San Michele in Acerboli’s Church in Santarcangelo Di Romagna (6th century) served as the test bed. The point cloud was utilized as an essential referential to model the BIM geometry using Autodesk Revit® 2022. To validate the accuracy of the model, Deviation Analysis Method was employed using CloudCompare software to determine the degree of deviation between the HBIM model and the point cloud. The acquired findings showed a very promising outcome in the average distance between the HBIM model and the point cloud. The conducted approach in this study demonstrated the viability of producing a precise BIM geometry from point clouds.
Resumo:
There is great interindividual variability in the response to GH therapy. Ascertaining genetic factors can improve the accuracy of growth response predictions. Suppressor of cytokine signaling (SOCS)-2 is an intracellular negative regulator of GH receptor (GHR) signaling. The objective of the study was to assess the influence of a SOCS2 polymorphism (rs3782415) and its interactive effect with GHR exon 3 and -202 A/C IGFBP3 (rs2854744) polymorphisms on adult height of patients treated with recombinant human GH (rhGH). Genotypes were correlated with adult height data of 65 Turner syndrome (TS) and 47 GH deficiency (GHD) patients treated with rhGH, by multiple linear regressions. Generalized multifactor dimensionality reduction was used to evaluate gene-gene interactions. Baseline clinical data were indistinguishable among patients with different genotypes. Adult height SD scores of patients with at least one SOCS2 single-nucleotide polymorphism rs3782415-C were 0.7 higher than those homozygous for the T allele (P < .001). SOCS2 (P = .003), GHR-exon 3 (P= .016) and -202 A/C IGFBP3 (P = .013) polymorphisms, together with clinical factors accounted for 58% of the variability in adult height and 82% of the total height SD score gain. Patients harboring any two negative genotypes in these three different loci (homozygosity for SOCS2 T allele; the GHR exon 3 full-length allele and/or the -202C-IGFBP3 allele) were more likely to achieve an adult height at the lower quartile (odds ratio of 13.3; 95% confidence interval of 3.2-54.2, P = .0001). The SOCS2 polymorphism (rs3782415) has an influence on the adult height of children with TS and GHD after long-term rhGH therapy. Polymorphisms located in GHR, IGFBP3, and SOCS2 loci have an influence on the growth outcomes of TS and GHD patients treated with rhGH. The use of these genetic markers could identify among rhGH-treated patients those who are genetically predisposed to have less favorable outcomes.
Resumo:
The purpose of this study was to assess the efficacy and reproducibility of the cytologic diagnosis of salivary gland tumors (SGTs) using fine-needle aspiration cytology (FNAC). The study aimed to determine diagnostic accuracy, sensitivity, and specificity and to evaluate the extent of interobserver agreement. We retrospectively evaluated SGTs from the files of the Division of Pathology at the Clinics Hospital of São Paulo and Piracicaba Dental School between 2000 and 2006. We performed cytohistologic correlation in 182 SGTs. The sensitivity, specificity, positive predictive value, negative predictive value, and diagnostic accuracy were 94%, 100%, 100%, 100%, and 99%, respectively. The interobserver cytologic reproducibility showed significant statistical concordance (P < .0001). FNAC is an effective tool for performing a reliable preoperative diagnosis in SGTs and shows high diagnostic accuracy and consistent interobserver reproducibility. Further FNAC studies analyzing large samples of malignant SGTs and reactive salivary lesions are needed to confirm their accuracy.
Resumo:
High-throughput screening of physical, genetic and chemical-genetic interactions brings important perspectives in the Systems Biology field, as the analysis of these interactions provides new insights into protein/gene function, cellular metabolic variations and the validation of therapeutic targets and drug design. However, such analysis depends on a pipeline connecting different tools that can automatically integrate data from diverse sources and result in a more comprehensive dataset that can be properly interpreted. We describe here the Integrated Interactome System (IIS), an integrative platform with a web-based interface for the annotation, analysis and visualization of the interaction profiles of proteins/genes, metabolites and drugs of interest. IIS works in four connected modules: (i) Submission module, which receives raw data derived from Sanger sequencing (e.g. two-hybrid system); (ii) Search module, which enables the user to search for the processed reads to be assembled into contigs/singlets, or for lists of proteins/genes, metabolites and drugs of interest, and add them to the project; (iii) Annotation module, which assigns annotations from several databases for the contigs/singlets or lists of proteins/genes, generating tables with automatic annotation that can be manually curated; and (iv) Interactome module, which maps the contigs/singlets or the uploaded lists to entries in our integrated database, building networks that gather novel identified interactions, protein and metabolite expression/concentration levels, subcellular localization and computed topological metrics, GO biological processes and KEGG pathways enrichment. This module generates a XGMML file that can be imported into Cytoscape or be visualized directly on the web. We have developed IIS by the integration of diverse databases following the need of appropriate tools for a systematic analysis of physical, genetic and chemical-genetic interactions. IIS was validated with yeast two-hybrid, proteomics and metabolomics datasets, but it is also extendable to other datasets. IIS is freely available online at: http://www.lge.ibi.unicamp.br/lnbio/IIS/.
Resumo:
Negative-ion mode electrospray ionization, ESI(-), with Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS) was coupled to a Partial Least Squares (PLS) regression and variable selection methods to estimate the total acid number (TAN) of Brazilian crude oil samples. Generally, ESI(-)-FT-ICR mass spectra present a power of resolution of ca. 500,000 and a mass accuracy less than 1 ppm, producing a data matrix containing over 5700 variables per sample. These variables correspond to heteroatom-containing species detected as deprotonated molecules, [M - H](-) ions, which are identified primarily as naphthenic acids, phenols and carbazole analog species. The TAN values for all samples ranged from 0.06 to 3.61 mg of KOH g(-1). To facilitate the spectral interpretation, three methods of variable selection were studied: variable importance in the projection (VIP), interval partial least squares (iPLS) and elimination of uninformative variables (UVE). The UVE method seems to be more appropriate for selecting important variables, reducing the dimension of the variables to 183 and producing a root mean square error of prediction of 0.32 mg of KOH g(-1). By reducing the size of the data, it was possible to relate the selected variables with their corresponding molecular formulas, thus identifying the main chemical species responsible for the TAN values.
Resumo:
The article seeks to investigate patterns of performance and relationships between grip strength, gait speed and self-rated health, and investigate the relationships between them, considering the variables of gender, age and family income. This was conducted in a probabilistic sample of community-dwelling elderly aged 65 and over, members of a population study on frailty. A total of 689 elderly people without cognitive deficit suggestive of dementia underwent tests of gait speed and grip strength. Comparisons between groups were based on low, medium and high speed and strength. Self-related health was assessed using a 5-point scale. The males and the younger elderly individuals scored significantly higher on grip strength and gait speed than the female and oldest did; the richest scored higher than the poorest on grip strength and gait speed; females and men aged over 80 had weaker grip strength and lower gait speed; slow gait speed and low income arose as risk factors for a worse health evaluation. Lower muscular strength affects the self-rated assessment of health because it results in a reduction in functional capacity, especially in the presence of poverty and a lack of compensatory factors.
Resumo:
Obstructive sleep apnea syndrome has a high prevalence among adults. Cephalometric variables can be a valuable method for evaluating patients with this syndrome. To correlate cephalometric data with the apnea-hypopnea sleep index. We performed a retrospective and cross-sectional study that analyzed the cephalometric data of patients followed in the Sleep Disorders Outpatient Clinic of the Discipline of Otorhinolaryngology of a university hospital, from June 2007 to May 2012. Ninety-six patients were included, 45 men, and 51 women, with a mean age of 50.3 years. A total of 11 patients had snoring, 20 had mild apnea, 26 had moderate apnea, and 39 had severe apnea. The distance from the hyoid bone to the mandibular plane was the only variable that showed a statistically significant correlation with the apnea-hypopnea index. Cephalometric variables are useful tools for the understanding of obstructive sleep apnea syndrome. The distance from the hyoid bone to the mandibular plane showed a statistically significant correlation with the apnea-hypopnea index.
Resumo:
In acquired immunodeficiency syndrome (AIDS) studies it is quite common to observe viral load measurements collected irregularly over time. Moreover, these measurements can be subjected to some upper and/or lower detection limits depending on the quantification assays. A complication arises when these continuous repeated measures have a heavy-tailed behavior. For such data structures, we propose a robust structure for a censored linear model based on the multivariate Student's t-distribution. To compensate for the autocorrelation existing among irregularly observed measures, a damped exponential correlation structure is employed. An efficient expectation maximization type algorithm is developed for computing the maximum likelihood estimates, obtaining as a by-product the standard errors of the fixed effects and the log-likelihood function. The proposed algorithm uses closed-form expressions at the E-step that rely on formulas for the mean and variance of a truncated multivariate Student's t-distribution. The methodology is illustrated through an application to an Human Immunodeficiency Virus-AIDS (HIV-AIDS) study and several simulation studies.
Resumo:
To assess the completeness and reliability of the Information System on Live Births (Sinasc) data. A cross-sectional analysis of the reliability and completeness of Sinasc's data was performed using a sample of Live Birth Certificate (LBC) from 2009, related to births from Campinas, Southeast Brazil. For data analysis, hospitals were grouped according to category of service (Unified National Health System, private or both), 600 LBCs were randomly selected and the data were collected in LBC-copies through mothers and newborns' hospital records and by telephone interviews. The completeness of LBCs was evaluated, calculating the percentage of blank fields, and the LBCs agreement comparing the originals with the copies was evaluated by Kappa and intraclass correlation coefficients. The percentage of completeness of LBCs ranged from 99.8%-100%. For the most items, the agreement was excellent. However, the agreement was acceptable for marital status, maternal education and newborn infants' race/color, low for prenatal visits and presence of birth defects, and very low for the number of deceased children. The results showed that the municipality Sinasc is reliable for most of the studied variables. Investments in training of the professionals are suggested in an attempt to improve system capacity to support planning and implementation of health activities for the benefit of maternal and child population.
Resumo:
Often in biomedical research, we deal with continuous (clustered) proportion responses ranging between zero and one quantifying the disease status of the cluster units. Interestingly, the study population might also consist of relatively disease-free as well as highly diseased subjects, contributing to proportion values in the interval [0, 1]. Regression on a variety of parametric densities with support lying in (0, 1), such as beta regression, can assess important covariate effects. However, they are deemed inappropriate due to the presence of zeros and/or ones. To evade this, we introduce a class of general proportion density, and further augment the probabilities of zero and one to this general proportion density, controlling for the clustering. Our approach is Bayesian and presents a computationally convenient framework amenable to available freeware. Bayesian case-deletion influence diagnostics based on q-divergence measures are automatic from the Markov chain Monte Carlo output. The methodology is illustrated using both simulation studies and application to a real dataset from a clinical periodontology study.