791 resultados para Web Data Mining
Resumo:
Clinicians could model the brain injury of a patient through his brain activity. However, how this model is defined and how it changes when the patient is recovering are questions yet unanswered. In this paper, the use of MedVir framework is proposed with the aim of answering these questions. Based on complex data mining techniques, this provides not only the differentiation between TBI patients and control subjects (with a 72% of accuracy using 0.632 Bootstrap validation), but also the ability to detect whether a patient may recover or not, and all of that in a quick and easy way through a visualization technique which allows interaction.
Resumo:
Tras los distintos análisis diseñados por Jorge Beltrán Luna en el proyecto "Aplicación de Inteligencia de Negocio a la Gestión Educativa" [Beltrán2014] sobre el comportamiento de los alumnos de la Universidad Politécnica de Madrid en las asignaturas cursadas por estos durante el curso 2013-2014, se llegó a la conclusión que se debía desarrollar una aplicación web mediante la cual pudiesen configurarse estos análisis con distintos parámetros para adecuarlos a los requerimientos del usuario. Este proyecto ha cumplido con el objetivo anteriormente mencionado. Se ha desarrollado una aplicación web capaz de mostrar por medio de un navegador web, las gráficas y tablas generadas por el programa de minería de datos. Mediante esta aplicación el usuario puede realizar diversas funciones. Una de ellas es la de solicitar mediante el formulario recibido en la interfaz principal de la aplicación, la visualización de los resultados generados por el sistema de acuerdo con los parámetros seleccionados por el diseñador de los análisis. El usuario conseguirá observar los resultados que obtendría si ejecutase directamente los análisis desarrollados en el proyecto de Jorge Beltrán Luna [Beltrán2014] en la herramienta Rapidminer. Otra de las funciones que podría realizar el usuario sería la de realizar estos mismos análisis pero modificando sus parámetros de configuración para adecuar dichos análisis a los resultados que se quiere obtener. El resultado será el que se habría conseguido en la aplicación Rapidminer si se cambiasen los mismos parámetros que los modificados en la página web de este prototipo. Por último, se ha diseñado un botón con el cual el usuario podrá recuperar el último análisis realizado, con el fin de que no sea necesario esperar el tiempo que tarde en realizarse el análisis para visualizar los resultados. También se ha realizado una explicación detallada de la aplicación de la inteligencia de negocio en el ámbito educacional. ABSTRACT. After different analysis designed by Jorge Beltran Luna in the "Aplicación de Inteligencia de Negocio a la Gestión Educativa" [Beltrán2014] project on the behaviour of the students at the Universidad Politécnica de Madrid during the course 2013-14, the tutor of this project concluded that it should be interesting to develop a web application through which teachers could view and configure these analysis with different parameters. This project has fulfilled the aforementioned objective. A web application has been develop to show through a web browser, the graphs and charts generated by the data mining tool. Using this application, the user can perform various features. One of this features is to request, employing the formulary received in the main interface, to display an analysis according to the chosen parameters. The user will see the results that would be observed in case that the analysis had been directly executed using the project designed by Jorge Beltrán Luna [Beltrán2014] in the RapidMiner tool. Another feature that the user could perform would be to make these analysis modifying its settings Similar result would be obtained in the RapidMiner tool in the case that identical modifications were carried out in the configuration parameters. Finally, a button to allow with recall the last analysis has been implemented. It is not necessary to wait for the execution of this analysis to see newly the results. A detailed explanation on the usage of business intelligence in the educational field has also been performed.
Resumo:
El presente proyecto de fin de grado es uno de los resultados generados en un proyecto de financiación privada por parte de Telefónica consistente en el desarrollo y posterior implantación de un sistema para minería de datos de empresas presentes en Internet. Este TFG surge a partir de un proyecto que el grupo de investigación AICU-LABS (Mercator) de la UPM ha desarrollado para Telefónica, y tiene como elemento principal el desarrollo de Agentes web (también llamados robots software, “softbots” o “crawlers”) capaces de obtener datos de empresas a partir de sus CIF a través de internet. El listado de empresas nos los proporciona Telefónica, y está compuesto por empresas que no son clientes de Telefónica en la actualidad. Nuestra misión es proporcionarles los datos necesarios (principalmente teléfono, correo electrónico y dirección de la empresa) para la creación de una base de datos de potenciales clientes. Para llevar a cabo esta tarea, se ha realizado una aplicación que, a partir de los CIF que nos proporcionan, busque información en internet y extraiga aquella que nos interese. Además se han desarrollado sistemas de validación de datos para ayudarnos a descartar datos no válidos y clasificar los datos según su calidad para así maximizar la calidad de los datos producidos por el robot. La búsqueda de datos se hará tanto en bases de datos online como, en caso de localizarlas, las propias páginas web de las empresas. ABSTRACT This Final Degree Project is one of the results obtained from a project funded by Telefónica. This project consists on the development and subsequent implantation of a system which performs data mining on companies operating on the Internet. This document arises from a project the research group AICU-LABS (Mercator) from the Universidad Politécnica de Madrid has developed for Telefónica. The main goal of this project is the creation of web agents (also known as “crawlers” or “web spiders”) able to obtain data from businesses through the Internet, knowing only their VAT identification number. The list of companies is given by Telefónica, and it is composed by companies that are not Telefónica’s customers today. Our mission is to provide the data required (mainly phone, email and address of the company) to create a database of potential customers. To perform this task, we’ve developed an application that, starting with the given VAT numbers, searches the web for information and extracts the data sought. In addition, we have developed data validation systems, that are capable of discarding low quality data and also sorting the data according to their quality, to maximize the quality of the results produced by the robot. We’ll use both the companies’ websites and external databases as our sources of information.
Resumo:
One challenge presented by large-scale genome sequencing efforts is effective display of uniform information to the scientific community. The Comprehensive Microbial Resource (CMR) contains robust annotation of all complete microbial genomes and allows for a wide variety of data retrievals. The bacterial information has been placed on the Web at http://www.tigr.org/CMR for retrieval using standard web browsing technology. Retrievals can be based on protein properties such as molecular weight or hydrophobicity, GC-content, functional role assignments and taxonomy. The CMR also has special web-based tools to allow data mining using pre-run homology searches, whole genome dot-plots, batch downloading and traversal across genomes using a variety of datatypes.
Resumo:
© The Author(s) 2014. Acknowledgements We thank the Information Services Division, Scotland, who provided the SMR01 data, and NHS Grampian, who provided the biochemistry data. We also thank the University of Aberdeen’s Data Management Team. Funding This work was supported by the Chief Scientists Office for Scotland (grant no. CZH/4/656).
Resumo:
The huge amount of data available on the Web needs to be organized in order to be accessible to users in real time. This paper presents a method for summarizing subjective texts based on the strength of the opinion expressed in them. We used a corpus of blog posts and their corresponding comments (blog threads) in English, structured around five topics and we divided them according to their polarity and subsequently summarized. Despite the difficulties of real Web data, the results obtained are encouraging; an average of 79% of the summaries is considered to be comprehensible. Our work allows the user to obtain a summary of the most relevant opinions contained in the blog. This allows them to save time and be able to look for information easily, allowing more effective searches on the Web.
Resumo:
Este artículo presenta la aplicación y resultados obtenidos de la investigación en técnicas de procesamiento de lenguaje natural y tecnología semántica en Brand Rain y Anpro21. Se exponen todos los proyectos relacionados con las temáticas antes mencionadas y se presenta la aplicación y ventajas de la transferencia de la investigación y nuevas tecnologías desarrolladas a la herramienta de monitorización y cálculo de reputación Brand Rain.
Resumo:
The environmental, cultural and socio-economic causes and consequences of farmland abandonment are issues of increasing concern for researchers and policy makers. In previous studies, we proposed a new methodology for selecting the driving factors in farmland abandonment processes. Using Data Mining and GIS, it is possible to select those variables which are more significantly related to abandonment. The aim of this study is to investigate the application of the above mentioned methodology for finding relationships between relief and farmland abandonment in a Mediterranean region (SE Spain).We have taken into account up to 28 different variables in a single analysis, some of them commonly considered in land use change studies (slope, altitude, TWI, etc), but also other novel variables have been evaluated (sky view factor, terrain view factor, etc). The variable selection process provides results in line with the previous knowledge of the study area, describing some processes that are region specific (e.g. abandonment versus intensification of the agricultural activities). The European INSPIRE Directive (2007/2/EC) establishes that the digital elevation models for land surfaces should be available in all member countries, this means that the research described in this work can be extrapolated to any European country to determine whether these variables (slope, altitude, etc) are important in the process of abandonment.
Resumo:
Quantile computation has many applications including data mining and financial data analysis. It has been shown that an is an element of-approximate summary can be maintained so that, given a quantile query d (phi, is an element of), the data item at rank [phi N] may be approximately obtained within the rank error precision is an element of N over all N data items in a data stream or in a sliding window. However, scalable online processing of massive continuous quantile queries with different phi and is an element of poses a new challenge because the summary is continuously updated with new arrivals of data items. In this paper, first we aim to dramatically reduce the number of distinct query results by grouping a set of different queries into a cluster so that they can be processed virtually as a single query while the precision requirements from users can be retained. Second, we aim to minimize the total query processing costs. Efficient algorithms are developed to minimize the total number of times for reprocessing clusters and to produce the minimum number of clusters, respectively. The techniques are extended to maintain near-optimal clustering when queries are registered and removed in an arbitrary fashion against whole data streams or sliding windows. In addition to theoretical analysis, our performance study indicates that the proposed techniques are indeed scalable with respect to the number of input queries as well as the number of items and the item arrival rate in a data stream.
Resumo:
There has been an increased demand for characterizing user access patterns using web mining techniques since the informative knowledge extracted from web server log files can not only offer benefits for web site structure improvement but also for better understanding of user navigational behavior. In this paper, we present a web usage mining method, which utilize web user usage and page linkage information to capture user access pattern based on Probabilistic Latent Semantic Analysis (PLSA) model. A specific probabilistic model analysis algorithm, EM algorithm, is applied to the integrated usage data to infer the latent semantic factors as well as generate user session clusters for revealing user access patterns. Experiments have been conducted on real world data set to validate the effectiveness of the proposed approach. The results have shown that the presented method is capable of characterizing the latent semantic factors and generating user profile in terms of weighted page vectors, which may reflect the common access interest exhibited by users among same session cluster.
Resumo:
Non-technical losses (NTL) identification and prediction are important tasks for many utilities. Data from customer information system (CIS) can be used for NTL analysis. However, in order to accurately and efficiently perform NTL analysis, the original data from CIS need to be pre-processed before any detailed NTL analysis can be carried out. In this paper, we propose a feature selection based method for CIS data pre-processing in order to extract the most relevant information for further analysis such as clustering and classifications. By removing irrelevant and redundant features, feature selection is an essential step in data mining process in finding optimal subset of features to improve the quality of result by giving faster time processing, higher accuracy and simpler results with fewer features. Detailed feature selection analysis is presented in the paper. Both time-domain and load shape data are compared based on the accuracy, consistency and statistical dependencies between features.
Resumo:
Hierarchical visualization systems are desirable because a single two-dimensional visualization plot may not be sufficient to capture all of the interesting aspects of complex high-dimensional data sets. We extend an existing locally linear hierarchical visualization system PhiVis [1] in several directions: bf(1) we allow for em non-linear projection manifolds (the basic building block is the Generative Topographic Mapping -- GTM), bf(2) we introduce a general formulation of hierarchical probabilistic models consisting of local probabilistic models organized in a hierarchical tree, bf(3) we describe folding patterns of low-dimensional projection manifold in high-dimensional data space by computing and visualizing the manifold's local directional curvatures. Quantities such as magnification factors [3] and directional curvatures are helpful for understanding the layout of the nonlinear projection manifold in the data space and for further refinement of the hierarchical visualization plot. Like PhiVis, our system is statistically principled and is built interactively in a top-down fashion using the EM algorithm. We demonstrate the visualization system principle of the approach on a complex 12-dimensional data set and mention possible applications in the pharmaceutical industry.
Resumo:
Today, the data available to tackle many scientific challenges is vast in quantity and diverse in nature. The exploration of heterogeneous information spaces requires suitable mining algorithms as well as effective visual interfaces. miniDVMS v1.8 provides a flexible visual data mining framework which combines advanced projection algorithms developed in the machine learning domain and visual techniques developed in the information visualisation domain. The advantage of this interface is that the user is directly involved in the data mining process. Principled projection methods, such as generative topographic mapping (GTM) and hierarchical GTM (HGTM), are integrated with powerful visual techniques, such as magnification factors, directional curvatures, parallel coordinates, and user interaction facilities, to provide this integrated visual data mining framework. The software also supports conventional visualisation techniques such as principal component analysis (PCA), Neuroscale, and PhiVis. This user manual gives an overview of the purpose of the software tool, highlights some of the issues to be taken care while creating a new model, and provides information about how to install and use the tool. The user manual does not require the readers to have familiarity with the algorithms it implements. Basic computing skills are enough to operate the software.
Resumo:
Recent developments in service-oriented and distributed computing have created exciting opportunities for the integration of models in service chains to create the Model Web. This offers the potential for orchestrating web data and processing services, in complex chains; a flexible approach which exploits the increased access to products and tools, and the scalability offered by the Web. However, the uncertainty inherent in data and models must be quantified and communicated in an interoperable way, in order for its effects to be effectively assessed as errors propagate through complex automated model chains. We describe a proposed set of tools for handling, characterizing and communicating uncertainty in this context, and show how they can be used to 'uncertainty- enable' Web Services in a model chain. An example implementation is presented, which combines environmental and publicly-contributed data to produce estimates of sea-level air pressure, with estimates of uncertainty which incorporate the effects of model approximation as well as the uncertainty inherent in the observational and derived data.
Resumo:
In this paper, a co-operative distributed process mining system (CDPMS) is developed to streamline the workflow along the supply chain in order to offer shorter delivery times, more flexibility and higher customer satisfaction with learning ability. The proposed system is equipped with the ‘distributed process mining’ feature which is used to discover the hidden relationships among each working decision in distributed manner. This method incorporates the concept of data mining and knowledge refinement into decision making process for ensuring ‘doing the right things’ within the workflow. An example of implementation is given, based on the case of slider manufacturer.