988 resultados para web clustering
Resumo:
The continued use of traditional lecturing across Higher Education as the main teaching and learning approach in many disciplines must be challenged. An increasing number of studies suggest that this approach, compared to more active learning methods, is the least effective. In counterargument, the use of traditional lectures are often justified as necessary given a large student population. By analysing the implementation of a web based broadcasting approach which replaced the traditional lecture within a programming-based module, and thereby removed the student population rationale, it was hoped that the student learning experience would become more active and ultimately enhance learning on the module. The implemented model replaces the traditional approach of students attending an on-campus lecture theatre with a web-based live broadcast approach that focuses on students being active learners rather than passive recipients. Students ‘attend’ by viewing a live broadcast of the lecturer, presented as a talking head, and the lecturer’s desktop, via a web browser. Video and audio communication is primarily from tutor to students, with text-based comments used to provide communication from students to tutor. This approach promotes active learning by allowing student to perform activities on their own computer rather than the passive viewing and listening common encountered in large lecture classes. By analysing this approach over two years (n = 234 students) results indicate that 89.6% of students rated the approach as offering a highly positive learning experience. Comparing student performance across three academic years also indicates a positive change. A small data analytic analysis was conducted into student participation levels and suggests that the student cohort's willingness to engage with the broadcast lectures material is high.
Resumo:
Clusters of text documents output by clustering algorithms are often hard to interpret. We describe motivating real-world scenarios that necessitate reconfigurability and high interpretability of clusters and outline the problem of generating clusterings with interpretable and reconfigurable cluster models. We develop two clustering algorithms toward the outlined goal of building interpretable and reconfigurable cluster models. They generate clusters with associated rules that are composed of conditions on word occurrences or nonoccurrences. The proposed approaches vary in the complexity of the format of the rules; RGC employs disjunctions and conjunctions in rule generation whereas RGC-D rules are simple disjunctions of conditions signifying presence of various words. In both the cases, each cluster is comprised of precisely the set of documents that satisfy the corresponding rule. Rules of the latter kind are easy to interpret, whereas the former leads to more accurate clustering. We show that our approaches outperform the unsupervised decision tree approach for rule-generating clustering and also an approach we provide for generating interpretable models for general clusterings, both by significant margins. We empirically show that the purity and f-measure losses to achieve interpretability can be as little as 3 and 5%, respectively using the algorithms presented herein.
Resumo:
Most traditional data mining algorithms struggle to cope with the sheer scale of data efficiently. In this paper, we propose a general framework to accelerate existing clustering algorithms to cluster large-scale datasets which contain large numbers of attributes, items, and clusters. Our framework makes use of locality sensitive hashing (LSH) to significantly reduce the cluster search space. We also theoretically prove that our framework has a guaranteed error bound in terms of the clustering quality. This framework can be applied to a set of centroid-based clustering algorithms that assign an object to the most similar cluster, and we adopt the popular K-Modes categorical clustering algorithm to present how the framework can be applied. We validated our framework with five synthetic datasets and a real world Yahoo! Answers dataset. The experimental results demonstrate that our framework is able to speed up the existing clustering algorithm between factors of 2 and 6, while maintaining comparable cluster purity.
Resumo:
Application of sensor-based technology within activity monitoring systems is becoming a popular technique within the smart environment paradigm. Nevertheless, the use of such an approach generates complex constructs of data, which subsequently requires the use of intricate activity recognition techniques to automatically infer the underlying activity. This paper explores a cluster-based ensemble method as a new solution for the purposes of activity recognition within smart environments. With this approach activities are modelled as collections of clusters built on different subsets of features. A classification process is performed by assigning a new instance to its closest cluster from each collection. Two different sensor data representations have been investigated, namely numeric and binary. Following the evaluation of the proposed methodology it has been demonstrated that the cluster-based ensemble method can be successfully applied as a viable option for activity recognition. Results following exposure to data collected from a range of activities indicated that the ensemble method had the ability to perform with accuracies of 94.2% and 97.5% for numeric and binary data, respectively. These results outperformed a range of single classifiers considered as benchmarks.
Resumo:
One of the most popular techniques of generating classifier ensembles is known as stacking which is based on a meta-learning approach. In this paper, we introduce an alternative method to stacking which is based on cluster analysis. Similar to stacking, instances from a validation set are initially classified by all base classifiers. The output of each classifier is subsequently considered as a new attribute of the instance. Following this, a validation set is divided into clusters according to the new attributes and a small subset of the original attributes of the instances. For each cluster, we find its centroid and calculate its class label. The collection of centroids is considered as a meta-classifier. Experimental results show that the new method outperformed all benchmark methods, namely Majority Voting, Stacking J48, Stacking LR, AdaBoost J48, and Random Forest, in 12 out of 22 data sets. The proposed method has two advantageous properties: it is very robust to relatively small training sets and it can be applied in semi-supervised learning problems. We provide a theoretical investigation regarding the proposed method. This demonstrates that for the method to be successful, the base classifiers applied in the ensemble should have greater than 50% accuracy levels.
Resumo:
We consider the problem of linking web search queries to entities from a knowledge base such as Wikipedia. Such linking enables converting a user’s web search session to a footprint in the knowledge base that could be used to enrich the user profile. Traditional methods for entity linking have been directed towards finding entity mentions in text documents such as news reports, each of which are possibly linked to multiple entities enabling the usage of measures like entity set coherence. Since web search queries are very small text fragments, such criteria that rely on existence of a multitude of mentions do not work too well on them. We propose a three-phase method for linking web search queries to wikipedia entities. The first phase does IR-style scoring of entities against the search query to narrow down to a subset of entities that are expanded using hyperlink information in the second phase to a larger set. Lastly, we use a graph traversal approach to identify the top entities to link the query to. Through an empirical evaluation on real-world web search queries, we illustrate that our methods significantly enhance the linking accuracy over state-of-the-art methods.
Resumo:
The past decade had witnessed an unprecedented growth in the amount of available digital content, and its volume is expected to continue to grow the next few years. Unstructured text data generated from web and enterprise sources form a large fraction of such content. Many of these contain large volumes of reusable data such as solutions to frequently occurring problems, and general know-how that may be reused in appropriate contexts. In this work, we address issues around leveraging unstructured text data from sources as diverse as the web and the enterprise within the Case-based Reasoning framework. Case-based Reasoning (CBR) provides a framework and methodology for systematic reuse of historical knowledge that is available in the form of problemsolution
pairs, in solving new problems. Here, we consider possibilities of enhancing Textual CBR systems under three main themes: procurement, maintenance and retrieval. We adapt and build upon the stateof-the-art techniques from data mining and natural language processing in addressing various challenges therein. Under procurement, we investigate the problem of extracting cases (i.e., problem-solution pairs) from data sources such as incident/experience
reports. We develop case-base maintenance methods specifically tuned to text targeted towards retaining solutions such that the utility of the filtered case base in solving new problems is maximized. Further, we address the problem of query suggestions for textual case-bases and show that exploiting the problem-solution partition can enhance retrieval effectiveness by prioritizing more useful query suggestions. Additionally, we illustrate interpretable clustering as a tool to drill-down to domain specific text collections (since CBR systems are usually very domain specific) and develop techniques for improved similarity assessment in social media sources such as microblogs. Through extensive empirical evaluations, we illustrate the improvements that we are able to
achieve over the state-of-the-art methods for the respective tasks.
Resumo:
The risks associated with zoonotic infections transmitted by companion animals are a serious public health concern: the control of zoonoses incidence in domestic dogs, both owned and stray, is hence important to protect human health. Integrated dog population management (DPM) programs, based on the availability of information systems providing reliable data on the structure and composition of the existing dog population in a given area, are fundamental for making realistic plans for any disease surveillance and action system. Traceability systems, based on the compulsory electronic identification of dogs and their registration in a computerised database, are one of the most effective ways to ensure the usefulness of DPM programs. Even if this approach provides many advantages, several areas of improvement have emerged in countries where it has been applied. In Italy, every region hosts its own dog register but these are not compatible with one another. This paper shows the advantages of a web-based-application to improve data management of dog regional registers. The approach used for building this system was inspired by farm animal traceability schemes and it relies on a network of services that allows multi-channel access by different devices and data exchange via the web with other existing applications, without changing the pre-existing platforms. Today the system manages a database for over 300,000 dogs registered in three different Italian regions. By integrating multiple Web Services, this approach could be the solution to gather data at national and international levels at reasonable cost and creating a traceability system on a large scale and across borders that can be used for disease surveillance and development of population management plans. © 2012 Elsevier B.V.
Resumo:
Social networks generally display a positively skewed degree distribution and higher values for clustering coefficient and degree assortativity than would be expected from the degree sequence. For some types of simulation studies, these properties need to be varied in the artificial networks over which simulations are to be conducted. Various algorithms to generate networks have been described in the literature but their ability to control all three of these network properties is limited. We introduce a spatially constructed algorithm that generates networks with constrained but arbitrary degree distribution, clustering coefficient and assortativity. Both a general approach and specific implementation are presented. The specific implementation is validated and used to generate networks with a constrained but broad range of property values. © Copyright JASSS.
Resumo:
Numa época marcada pelas novas tecnologias da comunicação e informação, o sector empresarial debate-se com a necessidade de marcar a diferença. Inovar na forma de contactar o cliente (ou possível cliente) e promover a sua marca são objectivos ambicionados pelas empresas ao investirem na sua representação online. Na Web 2.0 a partilha de informação, a instantaneidade nos contactos, o feedback imediato e a proximidade (aparente) são levados ao extremo e apresentam-se como argumentos capazes de suscitar alterações profundas ao nível das estratégias de comunicação empresarial online. Abordando as mais recentes tendências e ferramentas da Web 2.0 na presença online das organizações, recorrendo a revisão bibliográfica alargada, à aplicação e análise de inquéritos por questionário e à observação de presenças organizacionais na World Wide Web, neste estudo procura-se compreender “como estão as empresas nacionais a integrar, na sua presença online, características / ferramentas da Web 2.0”. ABSTRACT: In an era marked by new technologies of information and communication, the business sector has to contend with the need to make the difference. Innovating in the manner of contacting a (possible) client and promoting their brand is a companyʼs desired objective when investing in their online presence. In Web 2.0, the share of information, the instant contact, the immediate feedback and (apparent) proximity are taken to the extreme and are presented as arguments capable of modifying strategies related with a businessʼs online communication. Exploring the latest trends and tools of Web 2.0 in the online representation of organizations, as well as the use of an extended literature review, the application and analysis of surveys and the observation of organizational presences on the World Wide Web; this study seeks to understand "in what way are national companies integrating in their online presence features/tools of the Web 2.0".