Biblioteca Digital

23 resultados para CLUSTERING PROBLEM

em Repositório Científico do Instituto Politécnico de Lisboa - Portugal

Feature selection for clustering categorical data with an embedded modelling approach

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.

Clustering and selecting categorical features

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In data clustering, the problem of selecting the subset of most relevant features from the data has been an active research topic. Feature selection for clustering is a challenging task due to the absence of class labels for guiding the search for relevant features. Most methods proposed for this goal are focused on numerical data. In this work, we propose an approach for clustering and selecting categorical features simultaneously. We assume that the data originate from a finite mixture of multinomial distributions and implement an integrated expectation-maximization (EM) algorithm that estimates all the parameters of the model and selects the subset of relevant features simultaneously. The results obtained on synthetic data illustrate the performance of the proposed approach. An application to real data, referred to official statistics, shows its usefulness.

Probabilistic consensus clustering using evidence accumulation

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Clustering ensemble methods produce a consensus partition of a set of data points by combining the results of a collection of base clustering algorithms. In the evidence accumulation clustering (EAC) paradigm, the clustering ensemble is transformed into a pairwise co-association matrix, thus avoiding the label correspondence problem, which is intrinsic to other clustering ensemble schemes. In this paper, we propose a consensus clustering approach based on the EAC paradigm, which is not limited to crisp partitions and fully exploits the nature of the co-association matrix. Our solution determines probabilistic assignments of data points to clusters by minimizing a Bregman divergence between the observed co-association frequencies and the corresponding co-occurrence probabilities expressed as functions of the unknown assignments. We additionally propose an optimization algorithm to find a solution under any double-convex Bregman divergence. Experiments on both synthetic and real benchmark data show the effectiveness of the proposed approach.

On the Problem of Balancing the DC Capacitor Voltage Divider in Back-to-Back Multilevel Converters

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper presents a new generalized solution for DC bus capacitors voltage balancing in back-to-back m level diode-clamped multilevel converters connecting AC networks. The solution is based on the DC bus average power flow and exploits the switching configuration redundancies. The proposed balancing solution is particularized for the back-to-back multilevel structure with m=5 levels. This back-to-back converter is studied working with bidirectional power flow, connecting an induction machine to the power grid.

Formaldehyde in indoor air: a public health problem?

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Formaldehyde was the first air pollutant, which already in the 1970s emerged as a specifically non-industrial indoor air quality problem. Yet formaldehyde remained an indoor air quality issue and the formaldehyde level in residential indoor air is among the highest of any indoor air contaminant. Formaldehyde concentrations in 4 different indoor settings (schools, office buildings, new dwellings and occupied dwellings) in Portugal were measured using Photo Ionization Detection (PID) equipment (11,7 eV lamps). All the settings presented results higher than the reference value proposed by Portuguese legislation. Furthermore, occupied dwellings showed 3 units with results above the reference. We could conclude that formaldehyde presence is a reality in monitored indoor settings. Concentration levels are higher than the Portuguese reference value for indoor settings and these can indicate health problems for occupants.

The problem of estimating the volatility of zero coupon bond interest rate

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Financial literature and financial industry use often zero coupon yield curves as input for testing hypotheses, pricing assets or managing risk. They assume this provided data as accurate. We analyse implications of the methodology and of the sample selection criteria used to estimate the zero coupon bond yield term structure on the resulting volatility of spot rates with different maturities. We obtain the volatility term structure using historical volatilities and Egarch volatilities. As input for these volatilities we consider our own spot rates estimation from GovPX bond data and three popular interest rates data sets: from the Federal Reserve Board, from the US Department of the Treasury (H15), and from Bloomberg. We find strong evidence that the resulting zero coupon bond yield volatility estimates as well as the correlation coefficients among spot and forward rates depend significantly on the data set. We observe relevant differences in economic terms when volatilities are used to price derivatives.

Fungal contamination of poultries litter: a public health problem

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Exposure to certain fungi can cause human illness. Fungi cause adverse human health effects through three specific mechanisms: generation of a harmful immune response (e.g., allergy or hypersensitivity pneumonitis); direct infection by the fungal organism; by toxic-irritant effects from mold byproducts, such as mycotoxins. In Portugal there is an increasingly industry of large facilities that produce whole chickens for domestic consumption and only few investigations have reported on fungal contamination of the poultry litter. The material used for poultry litter is varied but normally can be constitute by: pine shavings; sawdust of eucalyptus; other types of wood; peanut; coffee; sugar cane; straw; hay; grass; paper processed. Litter is one of the most contributive factors to fungal contamination in poultries. Spreading litter is one of the tasks that normally involve higher exposure of the poultry workers to dust, fungi and their metabolites, such as VOC’s and mycotoxins. After being used and removed from poultries, litter is ploughed into agricultural soils, being this practice potentially dangerous for the soil environment, as well for both humans and animals. The goal of this study was to characterize litter’s fungal contamination and also to report the incidence of keratinophilic and toxigenic fungi.

Positive Solutions of the Dirichlet Problem for the One-dimensional Minkowski-Curvature Equation

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We discuss existence and multiplicity of positive solutions of the Dirichlet problem for the quasilinear ordinary differential equation-(u' / root 1 - u'(2))' = f(t, u). Depending on the behaviour of f = f(t, s) near s = 0, we prove the existence of either one, or two, or three, or infinitely many positive solutions. In general, the positivity of f is not required. All results are obtained by reduction to an equivalent non-singular problem to which variational or topological methods apply in a classical fashion.

Fungal contamination of poultry litter: a public health problem

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Although numerous studies have been conducted on microbial contaminants associated with various stages related to poultry and meat products processing, only a few reported on fungal contamination of poultry litter. The goals of this study were to (1) characterize litter fungal contamination and (2) report the incidence of keratinophilic and toxigenic fungi presence. Seven fresh and 14 aged litter samples were collected from 7 poultry farms. In addition, 27 air samples of 25 litters were also collected through impaction method, and after laboratory processing and incubation of collected samples, quantitative colony-forming units (CFU/m3) and qualitative results were obtained. Twelve different fungal species were detected in fresh litter and Penicillium was the most frequent genus found (59.9%), followed by Alternaria (17.8%), Cladosporium (7.1%), and Aspergillus (5.7%). With respect to aged litter, 19 different fungal species were detected, with Penicillium sp. the most frequently isolated (42.3%), followed by Scopulariopsis sp. (38.3%), Trichosporon sp. (8.8%), and Aspergillus sp. (5.5%). A significant positive correlation was found between litter fungal contamination (CFU/g) and air fungal contamination (CFU/m3). Litter fungal quantification and species identification have important implications in the evaluation of potential adverse health risks to exposed workers and animals. Spreading of poultry litter in agricultural fields is a potential public health concern, since keratinophilic (Scopulariopsis and Fusarium genus) as well as toxigenic fungi (Aspergillus, Fusarium, and Penicillium genus) were isolated.

Fast growing fungi: a problem to be solved to achieve characterization of occupational exposure to fungi in cork industry

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Chrysonilia sitophila is a common mould in cork industry and has been identified as a cause of IgE sensitization and occupational asthma. This fungal species have a fast growth rate that may inhibit others species’ growth causing underestimated data from characterization of occupational fungal exposure. Aiming to ascertain occupational exposure to fungi in cork industry, were analyzed papers from 2000 about the best air sampling method, to obtain quantification and identification of all airborne culturable fungi, besides the ones that have fast-growing rates. Impaction method don’t allows the collection of a representative air volume, because even with some media that restricts the growth of the colonies, in environments with higher fungal load, such as cork industry, the counting of the colonies is very difficult. Otherwise, impinger method permits the collection of a representative air volume, since we can make dilution of the collected volume. Besides culture methods that allows fungal identification trough macro- and micro-morphology, growth features, thermotolerance and ecological data, we can apply molecular biology with the impinger method, to detect the presence of non-viable particles and potential mycotoxin producers’ strains, and also to detect mycotoxins presence with ELISA or HPLC. Selection of the best air sampling method in each setting is crucial to achieve characterization of occupational exposure to fungi. Information about the prevalent fungal species in each setting and also the eventual fungal load it’s needed for a criterious selection.

EWASTEU Programme: proposals to minimise the problem of e‑waste

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Hoje em dia muitos dos equipamentos elétricos e eletrónicos que compramos ficam obsoletos num curto espaço de tempo por causa dos rápidos avanços tecnológicos neste campo. Equipamentos como computadores, telemóveis e equipamentos elétricos e eletrónicos de pequeno e grande porte são transformados em lixo eletrónico e muitos deles são despejados no lixo comum. Para alterar este cenário, a União Europeia publicou diretivas neste domínio com o intuito de controlar o crescimento do lixo eletrónico e reduzir o seu impacto. Neste contexto, a Universidade de Yaşar (Turquia) submeteu à União Europeia um projeto (EWASTEU) com o objetivo de fornecer uma visão do que está acontecer com o equipamento transformado em lixo eletrónico e de apresentar algumas propostas para minimizar este problema. Uma das principais questões a ser respondida será a adequação das diretivas europeias.

Pulmonary tuberculosis in Continental Portugal 2000-2010: temporal trends clustering as a tool for control evaluation

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Clustering analysis is a useful tool to detect and monitor disease patterns and, consequently, to contribute for an effective population disease management. Portugal has the highest incidence of tuberculosis in the European Union (in 2012, 21.6 cases per 100.000 inhabitants), although it has been decreasing consistently. Two critical PTB (Pulmonary Tuberculosis) areas, metropolitan Oporto and metropolitan Lisbon regions, were previously identified through spatial and space-time clustering for PTB incidence rate and risk factors. Identifying clusters of temporal trends can further elucidate policy makers about municipalities showing a faster or a slower TB control improvement.

Categorical data clustering using a minimum message length criterion

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.

Determining the number of clusters in categorical data

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Cluster analysis for categorical data has been an active area of research. A well-known problem in this area is the determination of the number of clusters, which is unknown and must be inferred from the data. In order to estimate the number of clusters, one often resorts to information criteria, such as BIC (Bayesian information criterion), MML (minimum message length, proposed by Wallace and Boulton, 1968), and ICL (integrated classification likelihood). In this work, we adopt the approach developed by Figueiredo and Jain (2002) for clustering continuous data. They use an MML criterion to select the number of clusters and a variant of the EM algorithm to estimate the model parameters. This EM variant seamlessly integrates model estimation and selection in a single algorithm. For clustering categorical data, we assume a finite mixture of multinomial distributions and implement a new EM algorithm, following a previous version (Silvestre et al., 2008). Results obtained with synthetic datasets are encouraging. The main advantage of the proposed approach, when compared to the above referred criteria, is the speed of execution, which is especially relevant when dealing with large data sets.

Fuzzy clustering applied to a demand response model in a smart grid contingency scenario

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper focus on a demand response model analysis in a smart grid context considering a contingency scenario. A fuzzy clustering technique is applied on the developed demand response model and an analysis is performed for the contingency scenario. Model considerations and architecture are described. The demand response developed model aims to support consumers decisions regarding their consumption needs and possible economic benefits.

«
1
2
»