10 resultados para Imbalanced datasets
em Cochin University of Science
Resumo:
Decision trees are very powerful tools for classification in data mining tasks that involves different types of attributes. When coming to handling numeric data sets, usually they are converted first to categorical types and then classified using information gain concepts. Information gain is a very popular and useful concept which tells you, whether any benefit occurs after splitting with a given attribute as far as information content is concerned. But this process is computationally intensive for large data sets. Also popular decision tree algorithms like ID3 cannot handle numeric data sets. This paper proposes statistical variance as an alternative to information gain as well as statistical mean to split attributes in completely numerical data sets. The new algorithm has been proved to be competent with respect to its information gain counterpart C4.5 and competent with many existing decision tree algorithms against the standard UCI benchmarking datasets using the ANOVA test in statistics. The specific advantages of this proposed new algorithm are that it avoids the computational overhead of information gain computation for large data sets with many attributes, as well as it avoids the conversion to categorical data from huge numeric data sets which also is a time consuming task. So as a summary, huge numeric datasets can be directly submitted to this algorithm without any attribute mappings or information gain computations. It also blends the two closely related fields statistics and data mining
Resumo:
The present study helped to understand the trend in rainfall patterns at smaller spatial scales and the large regional differences in the variability of rainfall. The effect of land use and orography on the diurnal variability is also understood. But a better understanding on the long term variation in rainfall is possible by using a longer dataset,which may provide insight into the rainfall variation over country during the past century. The basic mechanism behind the interannual rainfall variability would be possible with numerical studies using coupled Ocean-Atmosphere models. The regional difference in the active-break conditions points to the significance of regional studies than considering India as a single unit. The underlying dynamics of diurnal variability need to be studied by making use of a high resolution model as the present study could not simulate the local onshore circulation. Also the land use modification in this study, selected a region, which is surrounded by crop land. This implies the high possibility for the conversion of the remaining region to agricultural land. Therefore the study is useful than considering idealized conditions, but the adverse effect of irrigated crop is more than non-irrigated crop. Therefore, such studies would help to understand the climate changes occurred in the recent period. The large accumulation of rainfall between 300-600 m height of western Ghats has been found but the reason behind this need to be studied, which is possible by utilizing datasets that would better represent the orography and landuse over the region in high resolution model. Similarly a detailed analysis is needed to clearly identify the causative relations of the predictors identified with the predictant and the physical reasons behind them. New approaches that include nonlinear relationships and dynamical variables from model simulations can be included in the existing statistical models to improve the skill of the models. Also the statistical models for the forecasts of monsoon have to be continually updated.
Resumo:
Computational Biology is the research are that contributes to the analysis of biological data through the development of algorithms which will address significant research problems.The data from molecular biology includes DNA,RNA ,Protein and Gene expression data.Gene Expression Data provides the expression level of genes under different conditions.Gene expression is the process of transcribing the DNA sequence of a gene into mRNA sequences which in turn are later translated into proteins.The number of copies of mRNA produced is called the expression level of a gene.Gene expression data is organized in the form of a matrix. Rows in the matrix represent genes and columns in the matrix represent experimental conditions.Experimental conditions can be different tissue types or time points.Entries in the gene expression matrix are real values.Through the analysis of gene expression data it is possible to determine the behavioral patterns of genes such as similarity of their behavior,nature of their interaction,their respective contribution to the same pathways and so on. Similar expression patterns are exhibited by the genes participating in the same biological process.These patterns have immense relevance and application in bioinformatics and clinical research.Theses patterns are used in the medical domain for aid in more accurate diagnosis,prognosis,treatment planning.drug discovery and protein network analysis.To identify various patterns from gene expression data,data mining techniques are essential.Clustering is an important data mining technique for the analysis of gene expression data.To overcome the problems associated with clustering,biclustering is introduced.Biclustering refers to simultaneous clustering of both rows and columns of a data matrix. Clustering is a global whereas biclustering is a local model.Discovering local expression patterns is essential for identfying many genetic pathways that are not apparent otherwise.It is therefore necessary to move beyond the clustering paradigm towards developing approaches which are capable of discovering local patterns in gene expression data.A biclusters is a submatrix of the gene expression data matrix.The rows and columns in the submatrix need not be contiguous as in the gene expression data matrix.Biclusters are not disjoint.Computation of biclusters is costly because one will have to consider all the combinations of columans and rows in order to find out all the biclusters.The search space for the biclustering problem is 2 m+n where m and n are the number of genes and conditions respectively.Usually m+n is more than 3000.The biclustering problem is NP-hard.Biclustering is a powerful analytical tool for the biologist.The research reported in this thesis addresses the problem of biclustering.Ten algorithms are developed for the identification of coherent biclusters from gene expression data.All these algorithms are making use of a measure called mean squared residue to search for biclusters.The objective here is to identify the biclusters of maximum size with the mean squared residue lower than a given threshold. All these algorithms begin the search from tightly coregulated submatrices called the seeds.These seeds are generated by K-Means clustering algorithm.The algorithms developed can be classified as constraint based,greedy and metaheuristic.Constarint based algorithms uses one or more of the various constaints namely the MSR threshold and the MSR difference threshold.The greedy approach makes a locally optimal choice at each stage with the objective of finding the global optimum.In metaheuristic approaches particle Swarm Optimization(PSO) and variants of Greedy Randomized Adaptive Search Procedure(GRASP) are used for the identification of biclusters.These algorithms are implemented on the Yeast and Lymphoma datasets.Biologically relevant and statistically significant biclusters are identified by all these algorithms which are validated by Gene Ontology database.All these algorithms are compared with some other biclustering algorithms.Algorithms developed in this work overcome some of the problems associated with the already existing algorithms.With the help of some of the algorithms which are developed in this work biclusters with very high row variance,which is higher than the row variance of any other algorithm using mean squared residue, are identified from both Yeast and Lymphoma data sets.Such biclusters which make significant change in the expression level are highly relevant biologically.
Resumo:
This thesis Entitled Neuronal degeneration in streptozotocin induced diabetic rats: effect of aegle marmelose and pyridoxine in pancreatic B cell proliferation and neuronal survival. Diabetes mellitus, a chronic metabolic disorder results in neurological dysfunctions and structural changes in the CNS. Antioxidant therapy is a challenging but necessary dimension in the management of diabetes and neurodegenerative changes associated with it. Our results showed regional variation and imbalance in the expression pattern of dopaminergic receptor subtypes in diabetes and its role in imbalanced insulin signaling and glucose regulation. Disrupted dopaminergic signaling and increased hyperglycemic stress in diabetes contributed to the neuronal loss. Neuronal loss in diabetic rats mediated through the expression of pattern of GLUT-3, CREB, IGF-1, Akt-1, NF,B, second messengers- cAMP, cGMP, IP3 and activation of apoptotic factors factors- TNF-a,caspase-8. Disrupted dopaminergic receptor expressions and its signaling in pancreas contributed defective insulin secretion in diabetes. Activation of apoptotic factors- TNF- a,caspase-8 and defective functioning of neuronal survival factors, disrupted second messenger signaling modulated neuronal viability in diabetes. Hyperglycemic stress activated the expression of TNF-a,caspase-8, BAX and differential expression of anti oxidant enzymes- SOD and GPx in liver lead to apoptosis. Treatment of diabetic rats with insulin, Aegle marmelose and pyridoxine significantly reversed the altered dopaminergic neurotransmission, GLUT3, GLUT2, IGF-1 and second messenger signaling. Antihyperglycemic and antioxidant activity of Aegle marmelose and pyridoxine enhanced pancreatic B cell proliferation, increased insulin synthesis and secretion in diabetic rats. Thus our results conclude the neuroprotective and regenerating ability of Aegle marmelose and pyridoxine which in turn has a novel therapeutic role in the management of diabetes.
Resumo:
In this paper, moving flock patterns are mined from spatio- temporal datasets by incorporating a clustering algorithm. A flock is defined as the set of data that move together for a certain continuous amount of time. Finding out moving flock patterns using clustering algorithms is a potential method to find out frequent patterns of movement in large trajectory datasets. In this approach, SPatial clusteRing algoRithm thrOugh sWarm intelligence (SPARROW) is the clustering algorithm used. The advantage of using SPARROW algorithm is that it can effectively discover clusters of widely varying sizes and shapes from large databases. Variations of the proposed method are addressed and also the experimental results show that the problem of scalability and duplicate pattern formation is addressed. This method also reduces the number of patterns produced
Resumo:
In this paper, we propose a handwritten character recognition system for Malayalam language. The feature extraction phase consists of gradient and curvature calculation and dimensionality reduction using Principal Component Analysis. Directional information from the arc tangent of gradient is used as gradient feature. Strength of gradient in curvature direction is used as the curvature feature. The proposed system uses a combination of gradient and curvature feature in reduced dimension as the feature vector. For classification, discriminative power of Support Vector Machine (SVM) is evaluated. The results reveal that SVM with Radial Basis Function (RBF) kernel yield the best performance with 96.28% and 97.96% of accuracy in two different datasets. This is the highest accuracy ever reported on these datasets
Resumo:
A spectral angle based feature extraction method, Spectral Clustering Independent Component Analysis (SC-ICA), is proposed in this work to improve the brain tissue classification from Magnetic Resonance Images (MRI). SC-ICA provides equal priority to global and local features; thereby it tries to resolve the inefficiency of conventional approaches in abnormal tissue extraction. First, input multispectral MRI is divided into different clusters by a spectral distance based clustering. Then, Independent Component Analysis (ICA) is applied on the clustered data, in conjunction with Support Vector Machines (SVM) for brain tissue analysis. Normal and abnormal datasets, consisting of real and synthetic T1-weighted, T2-weighted and proton density/fluid-attenuated inversion recovery images, were used to evaluate the performance of the new method. Comparative analysis with ICA based SVM and other conventional classifiers established the stability and efficiency of SC-ICA based classification, especially in reproduction of small abnormalities. Clinical abnormal case analysis demonstrated it through the highest Tanimoto Index/accuracy values, 0.75/98.8%, observed against ICA based SVM results, 0.17/96.1%, for reproduced lesions. Experimental results recommend the proposed method as a promising approach in clinical and pathological studies of brain diseases
Resumo:
mbikulam Tiger Reserve of Western Ghats using Geospatial technology. The major objectives of the study are Land use land cover mapping (LULC) and Phytodiversity analysis. Satellite data was used to map the land use / land cover using supervised classification techniques in Erdas imagine. The change for a period of 32 years was assessed using the multi-temporal satellite datasets from Landsat MSS (1973), Landsat TM (1990), and IRS P6 LISS III (2005). A geospatial approach was used for the land cover analysis. Digital elevation models, Satellite imageries and SOI topo sheets were the data sets used in the analysis. Vegetation sampling plots distributed over the different forest types were enumerated and studied for Phytodiversity analysis.
Resumo:
This work presents an efficient method for volume rendering of glioma tumors from segmented 2D MRI Datasets with user interactive control, by replacing manual segmentation required in the state of art methods. The most common primary brain tumors are gliomas, evolving from the cerebral supportive cells. For clinical follow-up, the evaluation of the pre- operative tumor volume is essential. Tumor portions were automatically segmented from 2D MR images using morphological filtering techniques. These seg- mented tumor slices were propagated and modeled with the software package. The 3D modeled tumor consists of gray level values of the original image with exact tumor boundary. Axial slices of FLAIR and T2 weighted images were used for extracting tumors. Volumetric assessment of tumor volume with manual segmentation of its outlines is a time-consuming proc- ess and is prone to error. These defects are overcome in this method. Authors verified the performance of our method on several sets of MRI scans. The 3D modeling was also done using segmented 2D slices with the help of a medical software package called 3D DOCTOR for verification purposes. The results were validated with the ground truth models by the Radi- ologist.
Resumo:
This paper reports a novel region-based shape descriptor based on orthogonal Legendre moments. The preprocessing steps for invariance improvement of the proposed Improved Legendre Moment Descriptor (ILMD) are discussed. The performance of the ILMD is compared to the MPEG-7 approved region shape descriptor, angular radial transformation descriptor (ARTD), and the widely used Zernike moment descriptor (ZMD). Set B of the MPEG-7 CE-1 contour database and all the datasets of the MPEG-7 CE-2 region database were used for experimental validation. The average normalized modified retrieval rate (ANMRR) and precision- recall pair were employed for benchmarking the performance of the candidate descriptors. The ILMD has lower ANMRR values than ARTD for most of the datasets, and ARTD has a lower value compared to ZMD. This indicates that overall performance of the ILMD is better than that of ARTD and ZMD. This result is confirmed by the precision-recall test where ILMD was found to have better precision rates for most of the datasets tested. Besides retrieval accuracy, ILMD is more compact than ARTD and ZMD. The descriptor proposed is useful as a generic shape descriptor for content-based image retrieval (CBIR) applications