836 resultados para CATEGORICAL-DATA ANALYSIS
Resumo:
This work belongs to the field of computational high-energy physics (HEP). The key methods used in this thesis work to meet the challenges raised by the Large Hadron Collider (LHC) era experiments are object-orientation with software engineering, Monte Carlo simulation, the computer technology of clusters, and artificial neural networks. The first aspect discussed is the development of hadronic cascade models, used for the accurate simulation of medium-energy hadron-nucleus reactions, up to 10 GeV. These models are typically needed in hadronic calorimeter studies and in the estimation of radiation backgrounds. Various applications outside HEP include the medical field (such as hadron treatment simulations), space science (satellite shielding), and nuclear physics (spallation studies). Validation results are presented for several significant improvements released in Geant4 simulation tool, and the significance of the new models for computing in the Large Hadron Collider era is estimated. In particular, we estimate the ability of the Bertini cascade to simulate Compact Muon Solenoid (CMS) hadron calorimeter HCAL. LHC test beam activity has a tightly coupled cycle of simulation-to-data analysis. Typically, a Geant4 computer experiment is used to understand test beam measurements. Thus an another aspect of this thesis is a description of studies related to developing new CMS H2 test beam data analysis tools and performing data analysis on the basis of CMS Monte Carlo events. These events have been simulated in detail using Geant4 physics models, full CMS detector description, and event reconstruction. Using the ROOT data analysis framework we have developed an offline ANN-based approach to tag b-jets associated with heavy neutral Higgs particles, and we show that this kind of NN methodology can be successfully used to separate the Higgs signal from the background in the CMS experiment.
Resumo:
Accelerator mass spectrometry (AMS) is an ultrasensitive technique for measuring the concentration of a single isotope. The electric and magnetic fields of an electrostatic accelerator system are used to filter out other isotopes from the ion beam. The high velocity means that molecules can be destroyed and removed from the measurement background. As a result, concentrations down to one atom in 10^16 atoms are measurable. This thesis describes the construction of the new AMS system in the Accelerator Laboratory of the University of Helsinki. The system is described in detail along with the relevant ion optics. System performance and some of the 14C measurements done with the system are described. In a second part of the thesis, a novel statistical model for the analysis of AMS data is presented. Bayesian methods are used in order to make the best use of the available information. In the new model, instrumental drift is modelled with a continuous first-order autoregressive process. This enables rigorous normalization to standards measured at different times. The Poisson statistical nature of a 14C measurement is also taken into account properly, so that uncertainty estimates are much more stable. It is shown that, overall, the new model improves both the accuracy and the precision of AMS measurements. In particular, the results can be improved for samples with very low 14C concentrations or measured only a few times.
Resumo:
Aims: Develop and validate tools to estimate residual noise covariance in Planck frequency maps. Quantify signal error effects and compare different techniques to produce low-resolution maps. Methods: We derive analytical estimates of covariance of the residual noise contained in low-resolution maps produced using a number of map-making approaches. We test these analytical predictions using Monte Carlo simulations and their impact on angular power spectrum estimation. We use simulations to quantify the level of signal errors incurred in different resolution downgrading schemes considered in this work. Results: We find an excellent agreement between the optimal residual noise covariance matrices and Monte Carlo noise maps. For destriping map-makers, the extent of agreement is dictated by the knee frequency of the correlated noise component and the chosen baseline offset length. The significance of signal striping is shown to be insignificant when properly dealt with. In map resolution downgrading, we find that a carefully selected window function is required to reduce aliasing to the sub-percent level at multipoles, ell > 2Nside, where Nside is the HEALPix resolution parameter. We show that sufficient characterization of the residual noise is unavoidable if one is to draw reliable contraints on large scale anisotropy. Conclusions: We have described how to compute the low-resolution maps, with a controlled sky signal level, and a reliable estimate of covariance of the residual noise. We have also presented a method to smooth the residual noise covariance matrices to describe the noise correlations in smoothed, bandwidth limited maps.
Resumo:
Bangalore is experiencing unprecedented urbanisation in recent times due to concentrated developmental activities with impetus on IT (Information Technology) and BT (Biotechnology) sectors. The concentrated developmental activities has resulted in the increase in population and consequent pressure on infrastructure, natural resources, ultimately giving rise to a plethora of serious challenges such as urban flooding, climate change, etc. One of the perceived impact at local levels is the increase in sensible heat flux from the land surface to the atmosphere, which is also referred as heat island effect. In this communication, we report the changes in land surface temperature (LST) with respect to land cover changes during 1973 to 2007. A novel technique combining the information from sub-pixel class proportions with information from classified image (using signatures of the respective classes collected from the ground) has been used to achieve more reliable classification. The analysis showed positive correlation with the increase in paved surfaces and LST. 466% increase in paved surfaces (buildings, roads, etc.) has lead to the increase in LST by about 2 ºC during the last 2 decades, confirming urban heat island phenomenon. LSTs’ were relatively lower (~ 4 to 7 ºC) at land uses such as vegetation (parks/forests) and water bodies which act as heat sinks.
Resumo:
The rapid growth in the field of data mining has lead to the development of various methods for outlier detection. Though detection of outliers has been well explored in the context of numerical data, dealing with categorical data is still evolving. In this paper, we propose a two-phase algorithm for detecting outliers in categorical data based on a novel definition of outliers. In the first phase, this algorithm explores a clustering of the given data, followed by the ranking phase for determining the set of most likely outliers. The proposed algorithm is expected to perform better as it can identify different types of outliers, employing two independent ranking schemes based on the attribute value frequencies and the inherent clustering structure in the given data. Unlike some existing methods, the computational complexity of this algorithm is not affected by the number of outliers to be detected. The efficacy of this algorithm is demonstrated through experiments on various public domain categorical data sets.
Resumo:
Outlier detection in high dimensional categorical data has been a problem of much interest due to the extensive use of qualitative features for describing the data across various application areas. Though there exist various established methods for dealing with the dimensionality aspect through feature selection on numerical data, the categorical domain is actively being explored. As outlier detection is generally considered as an unsupervised learning problem due to lack of knowledge about the nature of various types of outliers, the related feature selection task also needs to be handled in a similar manner. This motivates the need to develop an unsupervised feature selection algorithm for efficient detection of outliers in categorical data. Addressing this aspect, we propose a novel feature selection algorithm based on the mutual information measure and the entropy computation. The redundancy among the features is characterized using the mutual information measure for identifying a suitable feature subset with less redundancy. The performance of the proposed algorithm in comparison with the information gain based feature selection shows its effectiveness for outlier detection. The efficacy of the proposed algorithm is demonstrated on various high-dimensional benchmark data sets employing two existing outlier detection methods.
Resumo:
In this paper, we consider applying derived knowledge base regarding the sensitivity and specificity of damage(s) to be detected by an SHM system being designed and qualified. These efforts are necessary toward developing capabilities in SHM system to classify reliably various probable damages through sequence of monitoring, i.e., damage precursor identification, detection of damage and monitoring its progression. We consider the particular problem of visual and ultrasonic NDE based SHM system design requirements, where the damage detection sensitivity and specificity data definitions for a class of structural components are established. Methodologies for SHM system specification creation are discussed in details. Examples are shown to illustrate how the physics of damage detection scheme limits particular damage detection sensitivity and specificity and further how these information can be used in algorithms to combine various different NDE schemes in an SHM system to enhance efficiency and effectiveness. Statistical and data driven models to determine the sensitivity and probability of damage detection (POD) has been demonstrated for plate with varying one-sided line crack using optical and ultrasonic based inspection techniques.
Resumo:
Anthropogenic aerosols play a crucial role in our environment, climate, and health. Assessment of spatial and temporal variation in anthropogenic aerosols is essential to determine their impact. Aerosols are of natural and anthropogenic origin and together constitute a composite aerosol system. Information about either component needs elimination of the other from the composite aerosol system. In the present work we estimated the anthropogenic aerosol fraction (AF) over the Indian region following two different approaches and inter-compared the estimates. We espouse multi-satellite data analysis and model simulations (using the CHIMERE Chemical transport model) to derive natural aerosol distribution, which was subsequently used to estimate AF over the Indian subcontinent. These two approaches are significantly different from each other. Natural aerosol satellite-derived information was extracted in terms of optical depth while model simulations yielded mass concentration. Anthropogenic aerosol fraction distribution was studied over two periods in 2008: premonsoon (March-May) and winter (November-February) in regard to the known distinct seasonality in aerosol loading and type over the Indian region. Although both techniques have derived the same property, considerable differences were noted in temporal and spatial distribution. Satellite retrieval of AF showed maximum values during the pre-monsoon and summer months while lowest values were observed in winter. On the other hand, model simulations showed the highest concentration of AF in winter and the lowest during pre-monsoon and summer months. Both techniques provided an annual average AF of comparable magnitude (similar to 0.43 +/- 0.06 from the satellite and similar to 0.48 +/- 0.19 from the model). For winter months the model-estimated AF was similar to 0.62 +/- 0.09, significantly higher than that (0.39 +/- 0.05) estimated from the satellite, while during pre-monsoon months satellite-estimated AF was similar to 0.46 +/- 0.06 and the model simulation estimation similar to 0.53 +/- 0.14. Preliminary results from this work indicate that model-simulated results are nearer to the actual variation as compared to satellite estimation in view of general seasonal variation in aerosol concentrations.
Resumo:
169 p. : il. col.
Resumo:
This document provides a simple introduction to research methods and analysis tools for biologists or environmental scientists, with particular emphasis on fish biology in devleoping countries.
Resumo:
Self-organizing maps (SOM) have been recognized as a powerful tool in data exploratoration, especially for the tasks of clustering on high dimensional data. However, clustering on categorical data is still a challenge for SOM. This paper aims to extend standard SOM to handle feature values of categorical type. A batch SOM algorithm (NCSOM) is presented concerning the dissimilarity measure and update method of map evolution for both numeric and categorical features simultaneously.
Resumo:
DNA microarray, or DNA chip, is a technology that allows us to obtain the expression level of many genes in a single experiment. The fact that numerical expression values can be easily obtained gives us the possibility to use multiple statistical techniques of data analysis. In this project microarray data is obtained from Gene Expression Omnibus, the repository of National Center for Biotechnology Information (NCBI). Then, the noise is removed and data is normalized, also we use hypothesis tests to find the most relevant genes that may be involved in a disease and use machine learning methods like KNN, Random Forest or Kmeans. For performing the analysis we use Bioconductor, packages in R for the analysis of biological data, and we conduct a case study in Alzheimer disease. The complete code can be found in https://github.com/alberto-poncelas/ bioc-alzheimer
Resumo:
ENGLISH: A two-stage sampling design is used to estimate the variances of the numbers of yellowfin in different age groups caught in the eastern Pacific Ocean. For purse seiners, the primary sampling unit (n) is a brine well containing fish from a month-area stratum; the number of fish lengths (m) measured from each well are the secondary units. The fish cannot be selected at random from the wells because of practical limitations. The effects of different sampling methods and other factors on the reliability and precision of statistics derived from the length-frequency data were therefore examined. Modifications are recommended where necessary. Lengths of fish measured during the unloading of six test wells revealed two forms of inherent size stratification: 1) short-term disruptions of existing pattern of sizes, and 2) transition zones between long-term trends in sizes. To some degree, all wells exhibited cyclic changes in mean size and variance during unloading. In half of the wells, it was observed that size selection by the unloaders induced a change in mean size. As a result of stratification, the sequence of sizes removed from all wells was non-random, regardless of whether a well contained fish from a single set or from more than one set. The number of modal sizes in a well was not related to the number of sets. In an additional well composed of fish from several sets, an experiment on vertical mixing indicated that a representative sample of the contents may be restricted to the bottom half of the well. The contents of the test wells were used to generate 25 simulated wells and to compare the results of three sampling methods applied to them. The methods were: (1) random sampling (also used as a standard), (2) protracted sampling, in which the selection process was extended over a large portion of a well, and (3) measuring fish consecutively during removal from the well. Repeated sampling by each method and different combinations indicated that, because the principal source of size variation occurred among primary units, increasing n was the most effective way to reduce the variance estimates of both the age-group sizes and the total number of fish in the landings. Protracted sampling largely circumvented the effects of size stratification, and its performance was essentially comparable to that of random sampling. Sampling by this method is recommended. Consecutive-fish sampling produced more biased estimates with greater variances. Analysis of the 1988 length-frequency samples indicated that, for age groups that appear most frequently in the catch, a minimum sampling frequency of one primary unit in six for each month-area stratum would reduce the coefficients of variation (CV) of their size estimates to approximately 10 percent or less. Additional stratification of samples by set type, rather than month-area alone, further reduced the CV's of scarce age groups, such as the recruits, and potentially improved their accuracy. The CV's of recruitment estimates for completely-fished cohorts during the 198184 period were in the vicinity of 3 to 8 percent. Recruitment estimates and their variances were also relatively insensitive to changes in the individual quarterly catches and variances, respectively, of which they were composed. SPANISH: Se usa un diseño de muestreo de dos etapas para estimar las varianzas de los números de aletas amari11as en distintos grupos de edad capturados en el Océano Pacifico oriental. Para barcos cerqueros, la unidad primaria de muestreo (n) es una bodega de salmuera que contenía peces de un estrato de mes-área; el numero de ta11as de peces (m) medidas de cada bodega es la unidad secundaria. Limitaciones de carácter practico impiden la selección aleatoria de peces de las bodegas. Por 10 tanto, fueron examinados los efectos de distintos métodos de muestreo y otros factores sobre la confiabilidad y precisión de las estadísticas derivadas de los datos de frecuencia de ta11a. Se recomiendan modificaciones donde sean necesarias. Las ta11as de peces medidas durante la descarga de seis bodegas de prueba revelaron dos formas de estratificación inherente por ta11a: 1) perturbaciones a corto plazo en la pauta de ta11as existente, y 2) zonas de transición entre las tendencias a largo plazo en las ta11as. En cierto grado, todas las bodegas mostraron cambios cíclicos en ta11a media y varianza durante la descarga. En la mitad de las bodegas, se observo que selección por ta11a por los descargadores indujo un cambio en la ta11a media. Como resultado de la estratificación, la secuencia de ta11as sacadas de todas las bodegas no fue aleatoria, sin considerar si una bodega contenía peces de un solo lance 0 de mas de uno. El numero de ta11as modales en una bodega no estaba relacionado al numero de lances. En una bodega adicional compuesta de peces de varios lances, un experimento de mezcla vertical indico que una muestra representativa del contenido podría estar limitada a la mitad inferior de la bodega. Se uso el contenido de las bodegas de prueba para generar 25 bodegas simuladas y comparar los resultados de tres métodos de muestreo aplicados a estas. Los métodos fueron: (1) muestreo aleatorio (usado también como norma), (2) muestreo extendido, en el cual el proceso de selección fue extendido sobre una porción grande de una bodega, y (3) medición consecutiva de peces durante la descarga de la bodega. EI muestreo repetido con cada método y distintas combinaciones de n y m indico que, puesto que la fuente principal de variación de ta11a ocurría entre las unidades primarias, aumentar n fue la manera mas eficaz de reducir las estimaciones de la varianza de las ta11as de los grupos de edad y el numero total de peces en los desembarcos. El muestreo extendido evito mayormente los efectos de la estratificación por ta11a, y su desempeño fue esencialmente comparable a aquel del muestreo aleatorio. Se recomienda muestrear con este método. El muestreo de peces consecutivos produjo estimaciones mas sesgadas con mayores varianzas. Un análisis de las muestras de frecuencia de ta11a de 1988 indico que, para los grupos de edad que aparecen con mayor frecuencia en la captura, una frecuencia de muestreo minima de una unidad primaria de cada seis para cada estrato de mes-área reduciría los coeficientes de variación (CV) de las estimaciones de ta11a correspondientes a aproximadamente 10% 0 menos. Una estratificación adicional de las muestras por tipo de lance, y no solamente mes-área, redujo aun mas los CV de los grupos de edad escasos, tales como los reclutas, y mejoró potencialmente su precisión. Los CV de las estimaciones del reclutamiento para las cohortes completamente pescadas durante 1981-1984 fueron alrededor de 3-8%. Las estimaciones del reclutamiento y sus varianzas fueron también relativamente insensibles a cambios en las capturas de trimestres individuales y las varianzas, respectivamente, de las cuales fueron derivadas. (PDF contains 70 pages)