3 resultados para NUMERICAL DATA

em Cochin University of Science


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Decision trees are very powerful tools for classification in data mining tasks that involves different types of attributes. When coming to handling numeric data sets, usually they are converted first to categorical types and then classified using information gain concepts. Information gain is a very popular and useful concept which tells you, whether any benefit occurs after splitting with a given attribute as far as information content is concerned. But this process is computationally intensive for large data sets. Also popular decision tree algorithms like ID3 cannot handle numeric data sets. This paper proposes statistical variance as an alternative to information gain as well as statistical mean to split attributes in completely numerical data sets. The new algorithm has been proved to be competent with respect to its information gain counterpart C4.5 and competent with many existing decision tree algorithms against the standard UCI benchmarking datasets using the ANOVA test in statistics. The specific advantages of this proposed new algorithm are that it avoids the computational overhead of information gain computation for large data sets with many attributes, as well as it avoids the conversion to categorical data from huge numeric data sets which also is a time consuming task. So as a summary, huge numeric datasets can be directly submitted to this algorithm without any attribute mappings or information gain computations. It also blends the two closely related fields statistics and data mining

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This study attempted to quantify the variations of the surface marine atmospheric boundary layer (MABL) parameters associated with the tropical Cyclone Gonu formed over the Arabian Sea during 30 May–7 June 2007 (just after the monsoon onset). These characteristics were evaluated in terms of surface wind, drag coefficient, wind stress, horizontal divergence, and frictional velocity using 0.5◦ × 0.5◦ resolution Quick Scatterometer (QuikSCAT) wind products. The variation of these different surface boundary layer parameters was studied for three defined cyclone life stages: prior to the formation, during, and after the cyclone passage. Drastic variations of the MABL parameters during the passage of the cyclone were observed. The wind strength increased from 12 to 22 m s−1 in association with different stages of Gonu. Frictional velocity increased from a value of 0.1–0.6 m s−1 during the formative stage of the system to a high value of 0.3–1.4 m s−1 during the mature stage. Drag coefficient varied from 1.5 × 10−3 to 2.5 × 10−3 during the occurrence of Gonu. Wind stress values varied from 0.4 to 1.1 N m−2. Wind stress curl values varied from 10 × 10−7 to 45 × 10−7 N m−3. Generally, convergent winds prevailed with the numerical value of divergence varying from 0 to –4 × 10−5 s−1. Maximum variations of the wind parameters were found in the wall cloud region of the cyclone. The parameters returned to normally observed values in 1–3 days after the cyclone passage

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Knowledge discovery in databases is the non-trivial process of identifying valid, novel potentially useful and ultimately understandable patterns from data. The term Data mining refers to the process which does the exploratory analysis on the data and builds some model on the data. To infer patterns from data, data mining involves different approaches like association rule mining, classification techniques or clustering techniques. Among the many data mining techniques, clustering plays a major role, since it helps to group the related data for assessing properties and drawing conclusions. Most of the clustering algorithms act on a dataset with uniform format, since the similarity or dissimilarity between the data points is a significant factor in finding out the clusters. If a dataset consists of mixed attributes, i.e. a combination of numerical and categorical variables, a preferred approach is to convert different formats into a uniform format. The research study explores the various techniques to convert the mixed data sets to a numerical equivalent, so as to make it equipped for applying the statistical and similar algorithms. The results of clustering mixed category data after conversion to numeric data type have been demonstrated using a crime data set. The thesis also proposes an extension to the well known algorithm for handling mixed data types, to deal with data sets having only categorical data. The proposed conversion has been validated on a data set corresponding to breast cancer. Moreover, another issue with the clustering process is the visualization of output. Different geometric techniques like scatter plot, or projection plots are available, but none of the techniques display the result projecting the whole database but rather demonstrate attribute-pair wise analysis