989 resultados para web clustering
Resumo:
This paper presents a hierarchical clustering method for semantic Web service discovery. This method aims to improve the accuracy and efficiency of the traditional service discovery using vector space model. The Web service is converted into a standard vector format through the Web service description document. With the help of WordNet, a semantic analysis is conducted to reduce the dimension of the term vector and to make semantic expansion to meet the user’s service request. The process and algorithm of hierarchical clustering based semantic Web service discovery is discussed. Validation is carried out on the dataset.
Resumo:
The exponential growth of the Internet, coupled with the increasing popularity of dynamically generated content on the World Wide Web, has created the need for more and faster Web servers capable of serving the over 100 million Internet users. Server clustering has emerged as a promising technique to build scalable Web servers. In this article we examine the seminal work, early products, and a sample of contemporary commercial offerings in the field of transparent Web server clustering. We broadly classify transparent server clustering into three categories.
Resumo:
Web document cluster analysis plays an important role in information retrieval by organizing large amounts of documents into a small number of meaningful clusters. Traditional web document clustering is based on the Vector Space Model (VSM), which takes into account only two-level (document and term) knowledge granularity but ignores the bridging paragraph granularity. However, this two-level granularity may lead to unsatisfactory clustering results with “false correlation”. In order to deal with the problem, a Hierarchical Representation Model with Multi-granularity (HRMM), which consists of five-layer representation of data and a twophase clustering process is proposed based on granular computing and article structure theory. To deal with the zero-valued similarity problemresulted from the sparse term-paragraphmatrix, an ontology based strategy and a tolerance-rough-set based strategy are introduced into HRMM. By using granular computing, structural knowledge hidden in documents can be more efficiently and effectively captured in HRMM and thus web document clusters with higher quality can be generated. Extensive experiments show that HRMM, HRMM with tolerancerough-set strategy, and HRMM with ontology all outperform VSM and a representative non VSM-based algorithm, WFP, significantly in terms of the F-Score.
Resumo:
In the last few years, we have observed an exponential increasing of the information systems, and parking information is one more example of them. The needs of obtaining reliable and updated information of parking slots availability are very important in the goal of traffic reduction. Also parking slot prediction is a new topic that has already started to be applied. San Francisco in America and Santander in Spain are examples of such projects carried out to obtain this kind of information. The aim of this thesis is the study and evaluation of methodologies for parking slot prediction and the integration in a web application, where all kind of users will be able to know the current parking status and also future status according to parking model predictions. The source of the data is ancillary in this work but it needs to be understood anyway to understand the parking behaviour. Actually, there are many modelling techniques used for this purpose such as time series analysis, decision trees, neural networks and clustering. In this work, the author explains the best techniques at this work, analyzes the result and points out the advantages and disadvantages of each one. The model will learn the periodic and seasonal patterns of the parking status behaviour, and with this knowledge it can predict future status values given a date. The data used comes from the Smart Park Ontinyent and it is about parking occupancy status together with timestamps and it is stored in a database. After data acquisition, data analysis and pre-processing was needed for model implementations. The first test done was with the boosting ensemble classifier, employed over a set of decision trees, created with C5.0 algorithm from a set of training samples, to assign a prediction value to each object. In addition to the predictions, this work has got measurements error that indicates the reliability of the outcome predictions being correct. The second test was done using the function fitting seasonal exponential smoothing tbats model. Finally as the last test, it has been tried a model that is actually a combination of the previous two models, just to see the result of this combination. The results were quite good for all of them, having error averages of 6.2, 6.6 and 5.4 in vacancies predictions for the three models respectively. This means from a parking of 47 places a 10% average error in parking slot predictions. This result could be even better with longer data available. In order to make this kind of information visible and reachable from everyone having a device with internet connection, a web application was made for this purpose. Beside the data displaying, this application also offers different functions to improve the task of searching for parking. The new functions, apart from parking prediction, were: - Park distances from user location. It provides all the distances to user current location to the different parks in the city. - Geocoding. The service for matching a literal description or an address to a concrete location. - Geolocation. The service for positioning the user. - Parking list panel. This is not a service neither a function, is just a better visualization and better handling of the information.
Resumo:
Hi ha diversos mètodes d'anàlisi que duen a terme una agrupació global de la sèries de mostres de microarrays, com SelfOrganizing Maps, o que realitzen agrupaments locals tenint en compte només un subconjunt de gens coexpressats, com Biclustering, entre d'altres. En aquest projecte s'ha desenvolupat una aplicació web: el PCOPSamplecl, és una eina que pertany als mètodes d'agrupació (clustering) local, que no busca subconjunts de gens coexpresats (anàlisi de relacions linials), si no parelles de gens que davant canvis fenotípics, la seva relació d'expressió pateix fluctuacions. El resultats del PCOPSamplecl seràn les diferents distribucions finals de clusters i les parelles de gens involucrades en aquests canvis fenotípics. Aquestes parelles de gens podràn ser estudiades per trobar la causa i efecte del canvi fenotípic. A més, l'eina facilita l'estudi de les dependències entre les diferents distribucions de clusters que proporciona l'aplicació per poder estudiar la intersecció entre clusters o l'aparició de subclusters (2 clusters d'una mateixa agrupació de clusters poden ser subclusters d'altres clusters de diferents distribucions de clusters). L'eina és disponible al servidor: http://revolutionresearch.uab.es/
Resumo:
In this project a research both in finding predictors via clustering techniques and in reviewing the Data Mining free software is achieved. The research is based in a case of study, from where additionally to the KDD free software used by the scientific community; a new free tool for pre-processing the data is presented. The predictors are intended for the e-learning domain as the data from where these predictors have to be inferred are student qualifications from different e-learning environments. Through our case of study not only clustering algorithms are tested but also additional goals are proposed.
Resumo:
A recurring task in the analysis of mass genome annotation data from high-throughput technologies is the identification of peaks or clusters in a noisy signal profile. Examples of such applications are the definition of promoters on the basis of transcription start site profiles, the mapping of transcription factor binding sites based on ChIP-chip data and the identification of quantitative trait loci (QTL) from whole genome SNP profiles. Input to such an analysis is a set of genome coordinates associated with counts or intensities. The output consists of a discrete number of peaks with respective volumes, extensions and center positions. We have developed for this purpose a flexible one-dimensional clustering tool, called MADAP, which we make available as a web server and as standalone program. A set of parameters enables the user to customize the procedure to a specific problem. The web server, which returns results in textual and graphical form, is useful for small to medium-scale applications, as well as for evaluation and parameter tuning in view of large-scale applications, requiring a local installation. The program written in C++ can be freely downloaded from ftp://ftp.epd.unil.ch/pub/software/unix/madap. The MADAP web server can be accessed at http://www.isrec.isb-sib.ch/madap/.
Resumo:
Economy, and consequently trade, is a fundamental part of human social organization which, until now, has not been studied within the network modeling framework. Here we present the first, to the best of our knowledge, empirical characterization of the world trade web, that is, the network built upon the trade relationships between different countries in the world. This network displays the typical properties of complex networks, namely, scale-free degree distribution, the small-world property, a high clustering coefficient, and, in addition, degree-degree correlation between different vertices. All these properties make the world trade web a complex network, which is far from being well described through a classical random network description.
Resumo:
Many recent Web 2.0 resource sharing applications can be subsumed under the "folksonomy" moniker. Regardless of the type of resource shared, all of these share a common structure describing the assignment of tags to resources by users. In this report, we generalize the notions of clustering and characteristic path length which play a major role in the current research on networks, where they are used to describe the small-world effects on many observable network datasets. To that end, we show that the notion of clustering has two facets which are not equivalent in the generalized setting. The new measures are evaluated on two large-scale folksonomy datasets from resource sharing systems on the web.
Resumo:
In this class, we will discuss network theory fundamentals, including concepts such as diameter, distance, clustering coefficient and others. We will also discuss different types of networks, such as scale-free networks, random networks etc. Readings: Graph structure in the Web, A. Broder and R. Kumar and F. Maghoul and P. Raghavan and S. Rajagopalan and R. Stata and A. Tomkins and J. Wiener Computer Networks 33 309--320 (2000) [Web link, Alternative Link] Optional: The Structure and Function of Complex Networks, M.E.J. Newman, SIAM Review 45 167--256 (2003) [Web link] Original course at: http://kmi.tugraz.at/staff/markus/courses/SS2008/707.000_web-science/
Resumo:
Server responsiveness and scalability are more important than ever in today’s client/server dominated network environments. Recently, researchers have begun to consider cluster-based computers using commodity hardware as an alternative to expensive specialized hardware for building scalable Web servers. In this paper, we present performance results comparing two cluster-based Web servers based on different server infrastructures: MAC-based dispatching (LSMAC) and IP-based dispatching (LSNAT). Both cluster-based server systems were implemented as application-space programs running on commodity hardware. We point out the advantages and disadvantages of both systems. We also identify when servers should be clustered and when clustering will not improve performance.
Resumo:
Abstract Background Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST "digital northern", are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space. Results Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: http://xerad.systemsbiology.net/simcluster. Conclusion Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data.
Resumo:
Bioinformatics, in the last few decades, has played a fundamental role to give sense to the huge amount of data produced. Obtained the complete sequence of a genome, the major problem of knowing as much as possible of its coding regions, is crucial. Protein sequence annotation is challenging and, due to the size of the problem, only computational approaches can provide a feasible solution. As it has been recently pointed out by the Critical Assessment of Function Annotations (CAFA), most accurate methods are those based on the transfer-by-homology approach and the most incisive contribution is given by cross-genome comparisons. In the present thesis it is described a non-hierarchical sequence clustering method for protein automatic large-scale annotation, called “The Bologna Annotation Resource Plus” (BAR+). The method is based on an all-against-all alignment of more than 13 millions protein sequences characterized by a very stringent metric. BAR+ can safely transfer functional features (Gene Ontology and Pfam terms) inside clusters by means of a statistical validation, even in the case of multi-domain proteins. Within BAR+ clusters it is also possible to transfer the three dimensional structure (when a template is available). This is possible by the way of cluster-specific HMM profiles that can be used to calculate reliable template-to-target alignments even in the case of distantly related proteins (sequence identity < 30%). Other BAR+ based applications have been developed during my doctorate including the prediction of Magnesium binding sites in human proteins, the ABC transporters superfamily classification and the functional prediction (GO terms) of the CAFA targets. Remarkably, in the CAFA assessment, BAR+ placed among the ten most accurate methods. At present, as a web server for the functional and structural protein sequence annotation, BAR+ is freely available at http://bar.biocomp.unibo.it/bar2.0.