987 resultados para Relevant features
Resumo:
Data mining can be defined as the extraction of implicit, previously un-known, and potentially useful information from data. Numerous re-searchers have been developing security technology and exploring new methods to detect cyber-attacks with the DARPA 1998 dataset for Intrusion Detection and the modified versions of this dataset KDDCup99 and NSL-KDD, but until now no one have examined the performance of the Top 10 data mining algorithms selected by experts in data mining. The compared classification learning algorithms in this thesis are: C4.5, CART, k-NN and Naïve Bayes. The performance of these algorithms are compared with accuracy, error rate and average cost on modified versions of NSL-KDD train and test dataset where the instances are classified into normal and four cyber-attack categories: DoS, Probing, R2L and U2R. Additionally the most important features to detect cyber-attacks in all categories and in each category are evaluated with Weka’s Attribute Evaluator and ranked according to Information Gain. The results show that the classification algorithm with best performance on the dataset is the k-NN algorithm. The most important features to detect cyber-attacks are basic features such as the number of seconds of a network connection, the protocol used for the connection, the network service used, normal or error status of the connection and the number of data bytes sent. The most important features to detect DoS, Probing and R2L attacks are basic features and the least important features are content features. Unlike U2R attacks, where the content features are the most important features to detect attacks.
Resumo:
In data clustering, the problem of selecting the subset of most relevant features from the data has been an active research topic. Feature selection for clustering is a challenging task due to the absence of class labels for guiding the search for relevant features. Most methods proposed for this goal are focused on numerical data. In this work, we propose an approach for clustering and selecting categorical features simultaneously. We assume that the data originate from a finite mixture of multinomial distributions and implement an integrated expectation-maximization (EM) algorithm that estimates all the parameters of the model and selects the subset of relevant features simultaneously. The results obtained on synthetic data illustrate the performance of the proposed approach. An application to real data, referred to official statistics, shows its usefulness.
Resumo:
This article presents a quantitative and objective approach to cat ganglion cell characterization and classification. The combination of several biologically relevant features such as diameter, eccentricity, fractal dimension, influence histogram, influence area, convex hull area, and convex hull diameter are derived from geometrical transforms and then processed by three different clustering methods (Ward's hierarchical scheme, K-means and genetic algorithm), whose results are then combined by a voting strategy. These experiments indicate the superiority of some features and also suggest some possible biological implications.
Resumo:
Abstract Background One of the least common types of alternative splicing is the complete retention of an intron in a mature transcript. Intron retention (IR) is believed to be the result of intron, rather than exon, definition associated with failure of the recognition of weak splice sites flanking short introns. Although studies on individual retained introns have been published, few systematic surveys of large amounts of data have been conducted on the mechanisms that lead to IR. Results TTo understand how sequence features are associated with or control IR, and to produce a generalized model that could reveal previously unknown signals that regulate this type of alternative splicing, we partitioned intron retention events observed in human cDNAs into two groups based on the relative abundance of both isoforms and compared relevant features. We found that a higher frequency of IR in human is associated with individual introns that have weaker splice sites, genes with shorter intron lengths, higher expression levels and lower density of both a set of exon splicing silencers (ESSs) and the intronic splicing enhancer GGG. Both groups of retained introns presented events conserved in mouse, in which the retained introns were also short and presented weaker splice sites. Conclusion Although our results confirmed that weaker splice sites are associated with IR, they showed that this feature alone cannot explain a non-negligible fraction of events. Our analysis suggests that cis-regulatory elements are likely to play a crucial role in regulating IR and also reveals previously unknown features that seem to influence its occurrence. These results highlight the importance of considering the interplay among these features in the regulation of the relative frequency of IR.
Resumo:
A new concept and a preliminary study for a monocolumn floating unit are introduced, aimed at exploring and producing oil in ultradeep waters. This platform, which combines two relevant features-great oil storage capacity and dry tree production capability-comprises two bodies with relatively independent heave motions between them. A parametric model is used to define the main design characteristics of the floating units. A set of design alternatives is generated using this procedure. These solutions are evaluated in terms of stability requirements and dynamic response. A mathematical model is developed to estimate the first order heave and pitch motions of the platform. Experimental tests are carried out in order to calibrate this model. The response of each body alone is estimated numerically using the WAMIT (R) code. This paper also includes a preliminary study on the platform mooring system and appendages. The study of the heave plates presents the gain, in terms of decreasing the motions, achieved by the introduction of the appropriate appendages to the platform. [DOI: 10.1115/1.4001429]
Resumo:
This paper examines the article system in interlanguage grammar focusing on Japanese learners of English, whose native language lacks articles. It will be demonstrated that for the acquisition of the English article system, count/mass distinctions and definiteness are the crucial factors. Although Japanese does not employ the article system to encode these aspects, it will be argued that they are nevertheless syntactically encoded through its classifier system. Hence, the problem for these learners must be to map these features onto the appropriate surface forms as the Missing Surface Inflection Hypothesis predicts (Prévost & White 2000). This suggestion will further be supported empirically by a fill-in-the article task. It will be concluded that these Japanese learners understand the English article system fairly well, possibly due to their native language, yet have problems with realizing the relevant features (i.e. count/mass distinctions and definiteness) in the target language.
Resumo:
The idea that within the bulk of leukemic cells there are immature progenitors which are intrinsically resistant to chemotherapy and able to repopulate the tumor after treatment is not recent. Nevertheless, the term leukemia stem cells (LSCs) has been adopted recently to describe these immature progenitors based on the fact that they share the most relevant features of the normal hematopoetic stem cells (HSCs), i.e. the self-renewal potential and quiescent status. LSCs differ from their normal counterparts and from the more differentiated leukemic cells regarding the default status of pathways regulating apoptosis, cell cycle, telomere maintenance and transport pumps activity. In addition, unique features regarding the interaction of these cells with the microenvironment have been characterized. Therapeutic strategies targeting these unique features are at different stages of development but the reported results are promising. The aim of this review is, by taking acute myeloid leukemia (AML) as a bona fide example, to discuss some of the mechanisms used by the LSCs to survive and the strategies which could be used to eradicate these cells.
Resumo:
Feature selection is one of important and frequently used techniques in data preprocessing. It can improve the efficiency and the effectiveness of data mining by reducing the dimensions of feature space and removing the irrelevant and redundant information. Feature selection can be viewed as a global optimization problem of finding a minimum set of M relevant features that describes the dataset as well as the original N attributes. In this paper, we apply the adaptive partitioned random search strategy into our feature selection algorithm. Under this search strategy, the partition structure and evaluation function is proposed for feature selection problem. This algorithm ensures the global optimal solution in theory and avoids complete randomness in search direction. The good property of our algorithm is shown through the theoretical analysis.
Resumo:
We have established the first example of an orthotopic xenograft model of human nonseminomatous germ cell tumour (NSGCT). This reproducible model exhibits many clinically relevant features including metastases to the retroperitoneal lymph nodes and lungs, making it an ideal tool for research into the development and progression of testicular germ cell tumours.
Resumo:
Essa dissertação estuda as características relevantes na formação do preço de venda e aluguel, analisando também as diferenças entre esses atributos para apartamentos na cidade de Vitória/ES, preenchendo uma lacuna ainda não desenvolvida, tendo em vista a possibilidade de comparação entre preços de aluguel e venda. O constructo teórico teve como fundamento abordagem de preços hedônicos, aplicada em estudos de Waugh (1928) e Court (1939), mas formalmente desenvolvida teoricamente por Lancaster (1966) e Rosen (1974), e aplicadas e discutidas por Palmquist (1984) e Sheppard (1999). A revisão de literatura mostra que existe impactos tanto em relação aos aspectos físicos dos imóveis, como características externas, como violência, facilidade de acesso, ou presença de estações de trens ou mercados no entorno, dentre outras. A amostra partiu de uma listagem de oferta de imóveis no site da Netimóveis durante os meses de maio e junho de 2014, contando com um número de 563 observações para venda e 185 para locação. Além dessas duas amostras, foram elaboradas análises com relação a subamostras que possuíam a variável valor do condomínio, buscando ampliar as variáveis explicativas coletadas. A análise dos resultados foi feita com utilização da estatística descritiva, correlação entre variáveis e regressão múltipla, sendo essa última aplicada nos 6 modelos propostos para cada amostra, posteriormente propondo um modelo final para venda e aluguel. No que tange as hipóteses utilizadas e aplicadas nos modelos, parte delas foram utilizadas tendo em base estudos prévios, e outras, como o sol da manhã, por exemplo, foram apresentadas como propostas. Dos resultados encontrados, muitos corroboraram com estudos anteriores, confirmando que variáveis como área, vagas na garagem, varanda, anda e posição de frente da unidade, piscina e localização em bairros nobres impactam positivamente no preço dos imóveis, independente se venda ou aluguel. Como diferenças, foi possível identificar que as variáveis presença de elevador, playground e valor do condomínio participam positivamente da explicação do preço de venda, enquanto, presença de quadra, mobília e sol da manhã explicam positivamente o valor do aluguel na amostra.
Resumo:
Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.
Resumo:
In machine learning and pattern recognition tasks, the use of feature discretization techniques may have several advantages. The discretized features may hold enough information for the learning task at hand, while ignoring minor fluctuations that are irrelevant or harmful for that task. The discretized features have more compact representations that may yield both better accuracy and lower training time, as compared to the use of the original features. However, in many cases, mainly with medium and high-dimensional data, the large number of features usually implies that there is some redundancy among them. Thus, we may further apply feature selection (FS) techniques on the discrete data, keeping the most relevant features, while discarding the irrelevant and redundant ones. In this paper, we propose relevance and redundancy criteria for supervised feature selection techniques on discrete data. These criteria are applied to the bin-class histograms of the discrete features. The experimental results, on public benchmark data, show that the proposed criteria can achieve better accuracy than widely used relevance and redundancy criteria, such as mutual information and the Fisher ratio.
Resumo:
Dissertação para obtenção do Grau de Mestre em Engenharia Biomédica
Resumo:
International businesses bring with them additional negotiation complexities and extra risks, thus calling for negotiation integrative solutions and additional legal protection. The recent economic crisis forced, companies, including SMEs, to look for international markets and face these additional complexities and issues. In the search for a practical and simplified solution, to serve less sophisticated companies, this paper brings insights from the negotiation literature to a specific legal issue. Specifically, I investigate the negotiation and use of contingent agreements as a tool for facilitating the negotiation process and managing risk in international deals. Looking into an international sale of goods from Portugal to Brazil, this paper proposes the structuring of two contingent contracts related to two category of products in order to demonstrate the potential benefits of some of its relevant features, specifically the creation of incentives and identification and allocation of future risks. In general, the structuring of contingent agreements is likely to provide positive results in mitigating the issues of lack of trust and dealing with the additional risks derived from international deals, therefore facilitating and improving the overall quality of the deal.
Resumo:
During must fermentation by Saccharomyces cerevisiae strains thousands of volatile aroma compounds are formed. The objective of the present work was to adapt computational approaches to analyze pheno-metabolomic diversity of a S. cerevisiae strain collection with different origins. Phenotypic and genetic characterization together with individual must fermentations were performed, and metabolites relevant to aromatic profiles were determined. Experimental results were projected onto a common coordinates system, revealing 17 statistical-relevant multi-dimensional modules, combining sets of most-correlated features of noteworthy biological importance. The present method allowed, as a breakthrough, to combine genetic, phenotypic and metabolomic data, which has not been possible so far due to difficulties in comparing different types of data. Therefore, the proposed computational approach revealed as successful to shed light into the holistic characterization of S. cerevisiae pheno-metabolome in must fermentative conditions. This will allow the identification of combined relevant features with application in selection of good winemaking strains.