947 resultados para Data mining, Text mining gerarchico, Classificazione semantica, Hierarchical text classification, Tassonomie web


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The purpose of this paper is to explain the notion of clustering and a concrete clustering method- agglomerative hierarchical clustering algorithm. It shows how a data mining method like clustering can be applied to the analysis of stocks, traded on the Bulgarian Stock Exchange in order to identify similar temporal behavior of the traded stocks. This problem is solved with the aid of a data mining tool that is called XLMiner™ for Microsoft Excel Office.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Il citofluorimetro è uno strumento impiegato in biologia genetica per analizzare dei campioni cellulari: esso, analizza individualmente le cellule contenute in un campione ed estrae, per ciascuna cellula, una serie di proprietà fisiche, feature, che la descrivono. L’obiettivo di questo lavoro è mettere a punto una metodologia integrata che utilizzi tali informazioni modellando, automatizzando ed estendendo alcune procedure che vengono eseguite oggi manualmente dagli esperti del dominio nell’analisi di alcuni parametri dell’eiaculato. Questo richiede lo sviluppo di tecniche biochimiche per la marcatura delle cellule e tecniche informatiche per analizzare il dato. Il primo passo prevede la realizzazione di un classificatore che, sulla base delle feature delle cellule, classifichi e quindi consenta di isolare le cellule di interesse per un particolare esame. Il secondo prevede l'analisi delle cellule di interesse, estraendo delle feature aggregate che possono essere indicatrici di certe patologie. Il requisito è la generazione di un report esplicativo che illustri, nella maniera più opportuna, le conclusioni raggiunte e che possa fungere da sistema di supporto alle decisioni del medico/biologo.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Trabalho de Projeto para obtenção do grau de Mestre em Engenharia Informática e de Computadores

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Ubiquitous computing software needs to be autonomous so that essential decisions such as how to configure its particular execution are self-determined. Moreover, data mining serves an important role for ubiquitous computing by providing intelligence to several types of ubiquitous computing applications. Thus, automating ubiquitous data mining is also crucial. We focus on the problem of automatically configuring the execution of a ubiquitous data mining algorithm. In our solution, we generate configuration decisions in a resource aware and context aware manner since the algorithm executes in an environment in which the context often changes and computing resources are often severely limited. We propose to analyze the execution behavior of the data mining algorithm by mining its past executions. By doing so, we discover the effects of resource and context states as well as parameter settings on the data mining quality. We argue that a classification model is appropriate for predicting the behavior of an algorithm?s execution and we concentrate on decision tree classifier. We also define taxonomy on data mining quality so that tradeoff between prediction accuracy and classification specificity of each behavior model that classifies by a different abstraction of quality, is scored for model selection. Behavior model constituents and class label transformations are formally defined and experimental validation of the proposed approach is also performed.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

It is a big challenge to acquire correct user profiles for personalized text classification since users may be unsure in providing their interests. Traditional approaches to user profiling adopt machine learning (ML) to automatically discover classification knowledge from explicit user feedback in describing personal interests. However, the accuracy of ML-based methods cannot be significantly improved in many cases due to the term independence assumption and uncertainties associated with them. This paper presents a novel relevance feedback approach for personalized text classification. It basically applies data mining to discover knowledge from relevant and non-relevant text and constraints specific knowledge by reasoning rules to eliminate some conflicting information. We also developed a Dempster-Shafer (DS) approach as the means to utilise the specific knowledge to build high-quality data models for classification. The experimental results conducted on Reuters Corpus Volume 1 and TREC topics support that the proposed technique achieves encouraging performance in comparing with the state-of-the-art relevance feedback models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The low accuracy rates of textshape dividers for digital ink diagrams are hindering their use in real world applications. While recognition of handwriting is well advanced and there have been many recognition approaches proposed for hand drawn sketches, there has been less attention on the division of text and drawing ink. Feature based recognition is a common approach for textshape division. However, the choice of features and algorithms are critical to the success of the recognition. We propose the use of data mining techniques to build more accurate textshape dividers. A comparative study is used to systematically identify the algorithms best suited for the specific problem. We have generated dividers using data mining with diagrams from three domains and a comprehensive ink feature library. The extensive evaluation on diagrams from six different domains has shown that our resulting dividers, using LADTree and LogitBoost, are significantly more accurate than three existing dividers.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

La tesi riguarda lo sviluppo di recommender system che hanno lo scopo di supportare chi è alla ricerca di un lavoro e le aziende che devono selezionare la giusta figura. A partire da un insieme di skill il sistema suggerisce alla persona la posizione lavorativa più affine al suo profilo, oppure a partire da una specifica posizione lavorativa suggerisce all'azienda la persona che più si avvicina alle sue esigenze.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the advent of Service Oriented Architecture, Web Services have gained tremendous popularity. Due to the availability of a large number of Web services, finding an appropriate Web service according to the requirement of the user is a challenge. This warrants the need to establish an effective and reliable process of Web service discovery. A considerable body of research has emerged to develop methods to improve the accuracy of Web service discovery to match the best service. The process of Web service discovery results in suggesting many individual services that partially fulfil the user’s interest. By considering the semantic relationships of words used in describing the services as well as the use of input and output parameters can lead to accurate Web service discovery. Appropriate linking of individual matched services should fully satisfy the requirements which the user is looking for. This research proposes to integrate a semantic model and a data mining technique to enhance the accuracy of Web service discovery. A novel three-phase Web service discovery methodology has been proposed. The first phase performs match-making to find semantically similar Web services for a user query. In order to perform semantic analysis on the content present in the Web service description language document, the support-based latent semantic kernel is constructed using an innovative concept of binning and merging on the large quantity of text documents covering diverse areas of domain of knowledge. The use of a generic latent semantic kernel constructed with a large number of terms helps to find the hidden meaning of the query terms which otherwise could not be found. Sometimes a single Web service is unable to fully satisfy the requirement of the user. In such cases, a composition of multiple inter-related Web services is presented to the user. The task of checking the possibility of linking multiple Web services is done in the second phase. Once the feasibility of linking Web services is checked, the objective is to provide the user with the best composition of Web services. In the link analysis phase, the Web services are modelled as nodes of a graph and an allpair shortest-path algorithm is applied to find the optimum path at the minimum cost for traversal. The third phase which is the system integration, integrates the results from the preceding two phases by using an original fusion algorithm in the fusion engine. Finally, the recommendation engine which is an integral part of the system integration phase makes the final recommendations including individual and composite Web services to the user. In order to evaluate the performance of the proposed method, extensive experimentation has been performed. Results of the proposed support-based semantic kernel method of Web service discovery are compared with the results of the standard keyword-based information-retrieval method and a clustering-based machine-learning method of Web service discovery. The proposed method outperforms both information-retrieval and machine-learning based methods. Experimental results and statistical analysis also show that the best Web services compositions are obtained by considering 10 to 15 Web services that are found in phase-I for linking. Empirical results also ascertain that the fusion engine boosts the accuracy of Web service discovery by combining the inputs from both the semantic analysis (phase-I) and the link analysis (phase-II) in a systematic fashion. Overall, the accuracy of Web service discovery with the proposed method shows a significant improvement over traditional discovery methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The wide range of contributing factors and circumstances surrounding crashes on road curves suggest that no single intervention can prevent these crashes. This paper presents a novel methodology, based on data mining techniques, to identify contributing factors and the relationship between them. It identifies contributing factors that influence the risk of a crash. Incident records, described using free text, from a large insurance company were analysed with rough set theory. Rough set theory was used to discover dependencies among data, and reasons using the vague, uncertain and imprecise information that characterised the insurance dataset. The results show that male drivers, who are between 50 and 59 years old, driving during evening peak hours are involved with a collision, had a lowest crash risk. Drivers between 25 and 29 years old, driving from around midnight to 6 am and in a new car has the highest risk. The analysis of the most significant contributing factors on curves suggests that drivers with driving experience of 25 to 42 years, who are driving a new vehicle have the highest crash cost risk, characterised by the vehicle running off the road and hitting a tree. This research complements existing statistically based tools approach to analyse road crashes. Our data mining approach is supported with proven theory and will allow road safety practitioners to effectively understand the dependencies between contributing factors and the crash type with the view to designing tailored countermeasures.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This thesis is a study for automatic discovery of text features for describing user information needs. It presents an innovative data-mining approach that discovers useful knowledge from both relevance and non-relevance feedback information. The proposed approach can largely reduce noises in discovered patterns and significantly improve the performance of text mining systems. This study provides a promising method for the study of Data Mining and Web Intelligence.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Identifying product families has been considered as an effective way to accommodate the increasing product varieties across the diverse market niches. In this paper, we propose a novel framework to identifying product families by using a similarity measure for a common product design data BOM (Bill of Materials) based on data mining techniques such as frequent mining and clus-tering. For calculating the similarity between BOMs, a novel Extended Augmented Adjacency Matrix (EAAM) representation is introduced that consists of information not only of the content and topology but also of the fre-quent structural dependency among the various parts of a product design. These EAAM representations of BOMs are compared to calculate the similarity between products and used as a clustering input to group the product fami-lies. When applied on a real-life manufacturing data, the proposed framework outperforms a current baseline that uses orthogonal Procrustes for grouping product families.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Telecommunications network management is based on huge amounts of data that are continuously collected from elements and devices from all around the network. The data is monitored and analysed to provide information for decision making in all operation functions. Knowledge discovery and data mining methods can support fast-pace decision making in network operations. In this thesis, I analyse decision making on different levels of network operations. I identify the requirements decision-making sets for knowledge discovery and data mining tools and methods, and I study resources that are available to them. I then propose two methods for augmenting and applying frequent sets to support everyday decision making. The proposed methods are Comprehensive Log Compression for log data summarisation and Queryable Log Compression for semantic compression of log data. Finally I suggest a model for a continuous knowledge discovery process and outline how it can be implemented and integrated to the existing network operations infrastructure.