6 resultados para Large Data Sets
em Cochin University of Science
Resumo:
Decision trees are very powerful tools for classification in data mining tasks that involves different types of attributes. When coming to handling numeric data sets, usually they are converted first to categorical types and then classified using information gain concepts. Information gain is a very popular and useful concept which tells you, whether any benefit occurs after splitting with a given attribute as far as information content is concerned. But this process is computationally intensive for large data sets. Also popular decision tree algorithms like ID3 cannot handle numeric data sets. This paper proposes statistical variance as an alternative to information gain as well as statistical mean to split attributes in completely numerical data sets. The new algorithm has been proved to be competent with respect to its information gain counterpart C4.5 and competent with many existing decision tree algorithms against the standard UCI benchmarking datasets using the ANOVA test in statistics. The specific advantages of this proposed new algorithm are that it avoids the computational overhead of information gain computation for large data sets with many attributes, as well as it avoids the conversion to categorical data from huge numeric data sets which also is a time consuming task. So as a summary, huge numeric datasets can be directly submitted to this algorithm without any attribute mappings or information gain computations. It also blends the two closely related fields statistics and data mining
Resumo:
The SST convection relation over tropical ocean and its impact on the South Asian monsoon is the first part of this thesis. Understanding the complicated relation between SST and convection is important for better prediction of the variability of the Indian monsoon in subseasonal, seasonal, interannual, and longer time scales. Improved global data sets from satellite scatterometer observations of SST, precipitation and refined reanalysis of global wind fields have made it possible to do a comprehensive study of the SST convection relation. Interaction of the monsoon and Indian ocean has been discussed. A coupled feedback process between SST and the Active-Break cycle of the Asian summer monsoon is a central theme of the thesis. The relation between SST and convection is very important in the field of numerical modeling of tropical rainfall. It is well known that models generally do very well simulating rainfall in areas of tropical convergence zones but are found unable to do satisfactory simulation in the monsoon areas. Thus in this study we critically examined the different mechanisms of generation of deep convection over these two distinct regions.The study reported in chapter 3 has shown that SST - convection relation over the warm pool regions of Indian and west Pacific oceans (monsoon areas) is in such a way that convection increases with SST in the SST range 26-29 C and for SST higher than 29-30 C convection decreases with increase of SST (it is called Waliser type). It is found that convection is induced in areas with SST gradients in the warm pool areas of Indian and west Pacific oceans. Once deep convection is initiated in the south of the warmest region of warm pool, the deep tropospheric heating by the latent heat released in the convective clouds produces strong low level wind fields (Low level Jet - LLJ) on the equatorward side of the warm pool and both the convection and wind are found to grow through a positive feedback process. Thus SST through its gradient acts only as an initiator of convection. The central region of the warm pool has very small SST gradients and large values of convection are associated with the cyclonic vorticity of the LLJ in the atmospheric boundary layer. The conditionally unstable atmosphere in the tropics is favorable for the production of deep convective clouds.
Resumo:
Computational Biology is the research are that contributes to the analysis of biological data through the development of algorithms which will address significant research problems.The data from molecular biology includes DNA,RNA ,Protein and Gene expression data.Gene Expression Data provides the expression level of genes under different conditions.Gene expression is the process of transcribing the DNA sequence of a gene into mRNA sequences which in turn are later translated into proteins.The number of copies of mRNA produced is called the expression level of a gene.Gene expression data is organized in the form of a matrix. Rows in the matrix represent genes and columns in the matrix represent experimental conditions.Experimental conditions can be different tissue types or time points.Entries in the gene expression matrix are real values.Through the analysis of gene expression data it is possible to determine the behavioral patterns of genes such as similarity of their behavior,nature of their interaction,their respective contribution to the same pathways and so on. Similar expression patterns are exhibited by the genes participating in the same biological process.These patterns have immense relevance and application in bioinformatics and clinical research.Theses patterns are used in the medical domain for aid in more accurate diagnosis,prognosis,treatment planning.drug discovery and protein network analysis.To identify various patterns from gene expression data,data mining techniques are essential.Clustering is an important data mining technique for the analysis of gene expression data.To overcome the problems associated with clustering,biclustering is introduced.Biclustering refers to simultaneous clustering of both rows and columns of a data matrix. Clustering is a global whereas biclustering is a local model.Discovering local expression patterns is essential for identfying many genetic pathways that are not apparent otherwise.It is therefore necessary to move beyond the clustering paradigm towards developing approaches which are capable of discovering local patterns in gene expression data.A biclusters is a submatrix of the gene expression data matrix.The rows and columns in the submatrix need not be contiguous as in the gene expression data matrix.Biclusters are not disjoint.Computation of biclusters is costly because one will have to consider all the combinations of columans and rows in order to find out all the biclusters.The search space for the biclustering problem is 2 m+n where m and n are the number of genes and conditions respectively.Usually m+n is more than 3000.The biclustering problem is NP-hard.Biclustering is a powerful analytical tool for the biologist.The research reported in this thesis addresses the problem of biclustering.Ten algorithms are developed for the identification of coherent biclusters from gene expression data.All these algorithms are making use of a measure called mean squared residue to search for biclusters.The objective here is to identify the biclusters of maximum size with the mean squared residue lower than a given threshold. All these algorithms begin the search from tightly coregulated submatrices called the seeds.These seeds are generated by K-Means clustering algorithm.The algorithms developed can be classified as constraint based,greedy and metaheuristic.Constarint based algorithms uses one or more of the various constaints namely the MSR threshold and the MSR difference threshold.The greedy approach makes a locally optimal choice at each stage with the objective of finding the global optimum.In metaheuristic approaches particle Swarm Optimization(PSO) and variants of Greedy Randomized Adaptive Search Procedure(GRASP) are used for the identification of biclusters.These algorithms are implemented on the Yeast and Lymphoma datasets.Biologically relevant and statistically significant biclusters are identified by all these algorithms which are validated by Gene Ontology database.All these algorithms are compared with some other biclustering algorithms.Algorithms developed in this work overcome some of the problems associated with the already existing algorithms.With the help of some of the algorithms which are developed in this work biclusters with very high row variance,which is higher than the row variance of any other algorithm using mean squared residue, are identified from both Yeast and Lymphoma data sets.Such biclusters which make significant change in the expression level are highly relevant biologically.
Resumo:
Data mining is one of the hottest research areas nowadays as it has got wide variety of applications in common man’s life to make the world a better place to live. It is all about finding interesting hidden patterns in a huge history data base. As an example, from a sales data base, one can find an interesting pattern like “people who buy magazines tend to buy news papers also” using data mining. Now in the sales point of view the advantage is that one can place these things together in the shop to increase sales. In this research work, data mining is effectively applied to a domain called placement chance prediction, since taking wise career decision is so crucial for anybody for sure. In India technical manpower analysis is carried out by an organization named National Technical Manpower Information System (NTMIS), established in 1983-84 by India's Ministry of Education & Culture. The NTMIS comprises of a lead centre in the IAMR, New Delhi, and 21 nodal centres located at different parts of the country. The Kerala State Nodal Centre is located at Cochin University of Science and Technology. In Nodal Centre, they collect placement information by sending postal questionnaire to passed out students on a regular basis. From this raw data available in the nodal centre, a history data base was prepared. Each record in this data base includes entrance rank ranges, reservation, Sector, Sex, and a particular engineering. From each such combination of attributes from the history data base of student records, corresponding placement chances is computed and stored in the history data base. From this data, various popular data mining models are built and tested. These models can be used to predict the most suitable branch for a particular new student with one of the above combination of criteria. Also a detailed performance comparison of the various data mining models is done.This research work proposes to use a combination of data mining models namely a hybrid stacking ensemble for better predictions. A strategy to predict the overall absorption rate for various branches as well as the time it takes for all the students of a particular branch to get placed etc are also proposed. Finally, this research work puts forward a new data mining algorithm namely C 4.5 * stat for numeric data sets which has been proved to have competent accuracy over standard benchmarking data sets called UCI data sets. It also proposes an optimization strategy called parameter tuning to improve the standard C 4.5 algorithm. As a summary this research work passes through all four dimensions for a typical data mining research work, namely application to a domain, development of classifier models, optimization and ensemble methods.
Resumo:
Knowledge discovery in databases is the non-trivial process of identifying valid, novel potentially useful and ultimately understandable patterns from data. The term Data mining refers to the process which does the exploratory analysis on the data and builds some model on the data. To infer patterns from data, data mining involves different approaches like association rule mining, classification techniques or clustering techniques. Among the many data mining techniques, clustering plays a major role, since it helps to group the related data for assessing properties and drawing conclusions. Most of the clustering algorithms act on a dataset with uniform format, since the similarity or dissimilarity between the data points is a significant factor in finding out the clusters. If a dataset consists of mixed attributes, i.e. a combination of numerical and categorical variables, a preferred approach is to convert different formats into a uniform format. The research study explores the various techniques to convert the mixed data sets to a numerical equivalent, so as to make it equipped for applying the statistical and similar algorithms. The results of clustering mixed category data after conversion to numeric data type have been demonstrated using a crime data set. The thesis also proposes an extension to the well known algorithm for handling mixed data types, to deal with data sets having only categorical data. The proposed conversion has been validated on a data set corresponding to breast cancer. Moreover, another issue with the clustering process is the visualization of output. Different geometric techniques like scatter plot, or projection plots are available, but none of the techniques display the result projecting the whole database but rather demonstrate attribute-pair wise analysis
Resumo:
The overall attempt of the study was aimed to understand the microphytoplankton community composition and its variations along a highly complex and dynamic marine ecosystem, the northern Arabian Sea. The data generated provides a first of its kind knowledge on the major primary producers of the region. There appears significant response among the microphytoplankton community structure towards the variations in the hydrographic conditions during the winter monsoon period. Interannually, variations were observed within the microphytoplankton community associated with the variability in temperature patterns and the intensity of convective mixing. Changing bloom pattern and dominating species among the phytoplankton community open new frontiers and vistas towards more intense study on the biological responses towards physical processes. The production of large amount of organic matter as a result of intense blooming of Noctiluca as well as diatoms aggregations augment the particulate organic substances in these ecosystem. This definitely influences the carbon dynamics of the northern Arabian Sea. Detailed investigations based on time series as well as trophodynamic studies are necessary to elucidate the carbon flux and associated impacts of winter-spring blooms in NEAS. Arabian sea is considered as one among the hotspot for carbon dynamics and the pioneering records on the major primary producers fuels carbon based export production studies and provides a platform for future research. Moreover upcoming researches based on satellite based remote sensing on productivity patterns utilizes these insitu observations and taxonomic data sets of phytoplankton for validation of bloom specific algorithm development and its implementation. Furthermore Saurashtra coast is considered as a major fishing zone of Indian EEZ. The studies on the phytoplankton in these regions provide valuable raw data for fishery prediction models and identifying fishing zones. With the Summary and Conclusion 177 baseline data obtained further trophodynamic studies can be initiated in the complex productive North Eastern Arabian Seas (NEAS) ecosystem that is still remaining unexplored.