17 resultados para Task Clustering
em Repositório Científico do Instituto Politécnico de Lisboa - Portugal
Resumo:
Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.
Resumo:
In data clustering, the problem of selecting the subset of most relevant features from the data has been an active research topic. Feature selection for clustering is a challenging task due to the absence of class labels for guiding the search for relevant features. Most methods proposed for this goal are focused on numerical data. In this work, we propose an approach for clustering and selecting categorical features simultaneously. We assume that the data originate from a finite mixture of multinomial distributions and implement an integrated expectation-maximization (EM) algorithm that estimates all the parameters of the model and selects the subset of relevant features simultaneously. The results obtained on synthetic data illustrate the performance of the proposed approach. An application to real data, referred to official statistics, shows its usefulness.
Resumo:
A presente dissertação fundamenta-se na monitorização e análise de uma “Unidade de Tratamento de Ar” baseada em tecnologia exsicante evaporativa assistida por energia solar. Primeiramente é apresentado um conjunto de conceitos associados ao tema abordado, de modo a facilitar a compreensão dos fenómenos termodinâmicos envolvidos. Após a abordagem inicial que serve de consolidação, é feito o ponto da situação actual, quer em termos tecnológicos, quer em termos de instalações existentes. A etapa seguinte da dissertação descreve pormenorizadamente a instalação do LNEG e seus princípios de funcionamento, nesta fase também é contemplada a descrição sumária dos equipamentos, dos componentes e dos elementos de medição e controlo presentes no sistema. A metodologia seguida assenta fortemente em dois pontos; primeiro, a monitorização do sistema através do software “Agilent VEE Pro” e pela análise dos dados recolhidos com auxílio da folha de cálculo do LNEG, desenvolvida especialmente para este sistema. Em segundo lugar, a metodologia seguida será a imposta pela “Tarefa 38” de acordo com a Agência Internacional de Energia, no âmbito do programa de Aquecimento e Arrefecimento Solar o que permite a comparação de sistemas a nível internacional. Alias a “Tarefa 38” pretende desenvolver esta tecnologia, ao nível da concepção, estandardização, optimização das instalações, conceder o fácil acesso à informação e promover a comparabilidade de resultados. O desempenho global do sistema é bastante positivo, o bom comportamento do sistema é ilustrado na análise gráfica feita nos modos de aquecimento e arrefecimento. O grau de satisfação dos utilizadores é bom e o sistema demonstra capacidade para manter o conforto térmico dos espaços a climatizar de acordo com as normas e regulamentos em vigor. A utilização das energias renováveis nos dias de hoje, como por exemplo a energia solar é mais do que uma obrigação, é um dever.
Resumo:
Dust is a complex mixture of particles of organic and inorganic origin and different gases absorbed in aerosol droplets. In a poultry unit include dried faecal matter and urine, skin flakes, ammonia, carbon dioxide, pollens, feed and litter particles, feathers, grain mites, fungi spores, bacteria, viruses and their constituents. Dust particles vary in size and differentiation between particle size fractions is important in health studies in order to quantify penetration within the respiratory system. A descriptive study was developed in order to assess exposure to particles in a poultry unit during different operations, namely routine examination and floor turn over. Direct-reading equipment was used (Lighthouse, model 3016 IAQ). Particle measurement was performed in 5 different sizes (PM0.5; PM1.0; PM2.5; PM5.0; PM10). The chemical composition of poultry litter was also determined by neutron activation analysis. Normally, the litter of poultry pavilions is turned over weekly and it was during this operation that the higher exposure of particles was observed. In all the tasks considered PM5.0 and PM10.0 were the sizes with higher concentrations values. PM10 is what turns out to have higher values and PM0.5 the lowest values. The chemical element with the highest concentration was Mg (5.7E6 mg.kg-1), followed by K (1.5E4 mg.kg-1), Ca (4.8E3 mg.kg-1), Na (1.7E3 mg.kg-1), Fe (2.1E2 mg.kg-1) and Zn (4.2E1 mg.kg-1). This high presence of particles in the respirable range (<5–7μm) means that poultry dust particles can penetrate into the gas exchange region of the lung. Larger particles (PM10) present a range of concentrations from 5.3E5 and 3.0E6 mg/m3.
Resumo:
Clustering analysis is a useful tool to detect and monitor disease patterns and, consequently, to contribute for an effective population disease management. Portugal has the highest incidence of tuberculosis in the European Union (in 2012, 21.6 cases per 100.000 inhabitants), although it has been decreasing consistently. Two critical PTB (Pulmonary Tuberculosis) areas, metropolitan Oporto and metropolitan Lisbon regions, were previously identified through spatial and space-time clustering for PTB incidence rate and risk factors. Identifying clusters of temporal trends can further elucidate policy makers about municipalities showing a faster or a slower TB control improvement.
Resumo:
Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.
Resumo:
This paper focus on a demand response model analysis in a smart grid context considering a contingency scenario. A fuzzy clustering technique is applied on the developed demand response model and an analysis is performed for the contingency scenario. Model considerations and architecture are described. The demand response developed model aims to support consumers decisions regarding their consumption needs and possible economic benefits.
Resumo:
Environment monitoring has an important role in occupational exposure assessment. However, due to several factors is done with insufficient frequency and normally don´t give the necessary information to choose the most adequate safety measures to avoid or control exposure. Identifying all the tasks developed in each workplace and conducting a task-based exposure assessment help to refine the exposure characterization and reduce assessment errors. A task-based assessment can provide also a better evaluation of exposure variability, instead of assessing personal exposures using continuous 8-hour time weighted average measurements. Health effects related with exposure to particles have mainly been investigated with mass-measuring instruments or gravimetric analysis. However, more recently, there are some studies that support that size distribution and particle number concentration may have advantages over particle mass concentration for assessing the health effects of airborne particles. Several exposure assessments were performed in different occupational settings (bakery, grill house, cork industry and horse stable) and were applied these two resources: task-based exposure assessment and particle number concentration by size. The results showed interesting results: task-based approach applied permitted to identify the tasks with higher exposure to the smaller particles (0.3 μm) in the different occupational settings. The data obtained allow more concrete and effective risk assessment and the identification of priorities for safety investments.
Resumo:
Biosignals analysis has become widespread, upstaging their typical use in clinical settings. Electrocardiography (ECG) plays a central role in patient monitoring as a diagnosis tool in today's medicine and as an emerging biometric trait. In this paper we adopt a consensus clustering approach for the unsupervised analysis of an ECG-based biometric records. This type of analysis highlights natural groups within the population under investigation, which can be correlated with ground truth information in order to gain more insights about the data. Preliminary results are promising, for meaningful clusters are extracted from the population under analysis. © 2014 EURASIP.
Resumo:
Locomotor tasks characterization plays an important role in trying to improve the quality of life of a growing elderly population. This paper focuses on this matter by trying to characterize the locomotion of two population groups with different functional fitness levels (high or low) while executing three different tasks-gait, stair ascent and stair descent. Features were extracted from gait data, and feature selection methods were used in order to get the set of features that allow differentiation between functional fitness level. Unsupervised learning was used to validate the sets obtained and, ultimately, indicated that it is possible to distinguish the two population groups. The sets of best discriminate features for each task are identified and thoroughly analysed. Copyright © 2014 SCITEPRESS - Science and Technology Publications. All rights reserved.
Resumo:
This paper focus on a demand response model analysis in a smart grid context considering a contingency scenario. A fuzzy clustering technique is applied on the developed demand response model and an analysis is performed for the contingency scenario. Model considerations and architecture are described. The demand response developed model aims to support consumers decisions regarding their consumption needs and possible economic benefits.
Resumo:
In cluster analysis, it can be useful to interpret the partition built from the data in the light of external categorical variables which are not directly involved to cluster the data. An approach is proposed in the model-based clustering context to select a number of clusters which both fits the data well and takes advantage of the potential illustrative ability of the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is noteworthy that each mixture model is fitted by the maximum likelihood methodology to the data, excluding the external variables which are used to select a relevant mixture model only. Numerical experiments illustrate the promising behaviour of the derived criterion. © 2014 Springer-Verlag Berlin Heidelberg.
Resumo:
In the present paper we focus on the performance of clustering algorithms using indices of paired agreement to measure the accordance between clusters and an a priori known structure. We specifically propose a method to correct all indices considered for agreement by chance - the adjusted indices are meant to provide a realistic measure of clustering performance. The proposed method enables the correction of virtually any index - overcoming previous limitations known in the literature - and provides very precise results. We use simulated datasets under diverse scenarios and discuss the pertinence of our proposal which is particularly relevant when poorly separated clusters are considered. Finally we compare the performance of EM and KMeans algorithms, within each of the simulated scenarios and generally conclude that EM generally yields best results.
Resumo:
Clustering ensemble methods produce a consensus partition of a set of data points by combining the results of a collection of base clustering algorithms. In the evidence accumulation clustering (EAC) paradigm, the clustering ensemble is transformed into a pairwise co-association matrix, thus avoiding the label correspondence problem, which is intrinsic to other clustering ensemble schemes. In this paper, we propose a consensus clustering approach based on the EAC paradigm, which is not limited to crisp partitions and fully exploits the nature of the co-association matrix. Our solution determines probabilistic assignments of data points to clusters by minimizing a Bregman divergence between the observed co-association frequencies and the corresponding co-occurrence probabilities expressed as functions of the unknown assignments. We additionally propose an optimization algorithm to find a solution under any double-convex Bregman divergence. Experiments on both synthetic and real benchmark data show the effectiveness of the proposed approach.
Resumo:
In cluster analysis, it can be useful to interpret the partition built from the data in the light of external categorical variables which are not directly involved to cluster the data. An approach is proposed in the model-based clustering context to select a number of clusters which both fits the data well and takes advantage of the potential illustrative ability of the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is noteworthy that each mixture model is fitted by the maximum likelihood methodology to the data, excluding the external variables which are used to select a relevant mixture model only. Numerical experiments illustrate the promising behaviour of the derived criterion.