788 resultados para data mining applications
Resumo:
The Microarray technique is rather powerful, as it allows to test up thousands of genes at a time, but this produces an overwhelming set of data files containing huge amounts of data, which is quite difficult to pre-process, separate, classify and correlate for interesting conclusions to be extracted. Modern machine learning, data mining and clustering techniques based on information theory, are needed to read and interpret the information contents buried in those large data sets. Independent Component Analysis method can be used to correct the data affected by corruption processes or to filter the uncorrectable one and then clustering methods can group similar genes or classify samples. In this paper a hybrid approach is used to obtain a two way unsupervised clustering for a corrected microarray data.
Resumo:
Sensor networks are increasingly becoming one of the main sources of Big Data on the Web. However, the observations that they produce are made available with heterogeneous schemas, vocabularies and data formats, making it difficult to share and reuse these data for other purposes than those for which they were originally set up. In this thesis we address these challenges, considering how we can transform streaming raw data to rich ontology-based information that is accessible through continuous queries for streaming data. Our main contribution is an ontology-based approach for providing data access and query capabilities to streaming data sources, allowing users to express their needs at a conceptual level, independent of implementation and language-specific details. We introduce novel query rewriting and data translation techniques that rely on mapping definitions relating streaming data models to ontological concepts. Specific contributions include: • The syntax and semantics of the SPARQLStream query language for ontologybased data access, and a query rewriting approach for transforming SPARQLStream queries into streaming algebra expressions. • The design of an ontology-based streaming data access engine that can internally reuse an existing data stream engine, complex event processor or sensor middleware, using R2RML mappings for defining relationships between streaming data models and ontology concepts. Concerning the sensor metadata of such streaming data sources, we have investigated how we can use raw measurements to characterize streaming data, producing enriched data descriptions in terms of ontological models. Our specific contributions are: • A representation of sensor data time series that captures gradient information that is useful to characterize types of sensor data. • A method for classifying sensor data time series and determining the type of data, using data mining techniques, and a method for extracting semantic sensor metadata features from the time series.
Resumo:
In the last decade, complex networks have widely been applied to the study of many natural and man-made systems, and to the extraction of meaningful information from the interaction structures created by genes and proteins. Nevertheless, less attention has been devoted to metabonomics, due to the lack of a natural network representation of spectral data. Here we define a technique for reconstructing networks from spectral data sets, where nodes represent spectral bins, and pairs of them are connected when their intensities follow a pattern associated with a disease. The structural analysis of the resulting network can then be used to feed standard data-mining algorithms, for instance for the classification of new (unlabeled) subjects. Furthermore, we show how the structure of the network is resilient to the presence of external additive noise, and how it can be used to extract relevant knowledge about the development of the disease.
Resumo:
Nanotechnology represents an area of particular promise and significant opportunity across multiple scientific disciplines. Ongoing nanotechnology research ranges from the characterization of nanoparticles and nanomaterials to the analysis and processing of experimental data seeking correlations between nanoparticles and their functionalities and side effects. Due to their special properties, nanoparticles are suitable for cellular-level diagnostics and therapy, offering numerous applications in medicine, e.g. development of biomedical devices, tissue repair, drug delivery systems and biosensors. In nanomedicine, recent studies are producing large amounts of structural and property data, highlighting the role for computational approaches in information management. While in vitro and in vivo assays are expensive, the cost of computing is falling. Furthermore, improvements in the accuracy of computational methods (e.g. data mining, knowledge discovery, modeling and simulation) have enabled effective tools to automate the extraction, management and storage of these vast data volumes. Since this information is widely distributed, one major issue is how to locate and access data where it resides (which also poses data-sharing limitations). The novel discipline of nanoinformatics addresses the information challenges related to nanotechnology research. In this paper, we summarize the needs and challenges in the field and present an overview of extant initiatives and efforts.
Resumo:
Los avances en el hardware permiten disponer de grandes volúmenes de datos, surgiendo aplicaciones que deben suministrar información en tiempo cuasi-real, la monitorización de pacientes, ej., el seguimiento sanitario de las conducciones de agua, etc. Las necesidades de estas aplicaciones hacen emerger el modelo de flujo de datos (data streaming) frente al modelo almacenar-para-despuésprocesar (store-then-process). Mientras que en el modelo store-then-process, los datos son almacenados para ser posteriormente consultados; en los sistemas de streaming, los datos son procesados a su llegada al sistema, produciendo respuestas continuas sin llegar a almacenarse. Esta nueva visión impone desafíos para el procesamiento de datos al vuelo: 1) las respuestas deben producirse de manera continua cada vez que nuevos datos llegan al sistema; 2) los datos son accedidos solo una vez y, generalmente, no son almacenados en su totalidad; y 3) el tiempo de procesamiento por dato para producir una respuesta debe ser bajo. Aunque existen dos modelos para el cómputo de respuestas continuas, el modelo evolutivo y el de ventana deslizante; éste segundo se ajusta mejor en ciertas aplicaciones al considerar únicamente los datos recibidos más recientemente, en lugar de todo el histórico de datos. En los últimos años, la minería de datos en streaming se ha centrado en el modelo evolutivo. Mientras que, en el modelo de ventana deslizante, el trabajo presentado es más reducido ya que estos algoritmos no sólo deben de ser incrementales si no que deben borrar la información que caduca por el deslizamiento de la ventana manteniendo los anteriores tres desafíos. Una de las tareas fundamentales en minería de datos es la búsqueda de agrupaciones donde, dado un conjunto de datos, el objetivo es encontrar grupos representativos, de manera que se tenga una descripción sintética del conjunto. Estas agrupaciones son fundamentales en aplicaciones como la detección de intrusos en la red o la segmentación de clientes en el marketing y la publicidad. Debido a las cantidades masivas de datos que deben procesarse en este tipo de aplicaciones (millones de eventos por segundo), las soluciones centralizadas puede ser incapaz de hacer frente a las restricciones de tiempo de procesamiento, por lo que deben recurrir a descartar datos durante los picos de carga. Para evitar esta perdida de datos, se impone el procesamiento distribuido de streams, en concreto, los algoritmos de agrupamiento deben ser adaptados para este tipo de entornos, en los que los datos están distribuidos. En streaming, la investigación no solo se centra en el diseño para tareas generales, como la agrupación, sino también en la búsqueda de nuevos enfoques que se adapten mejor a escenarios particulares. Como ejemplo, un mecanismo de agrupación ad-hoc resulta ser más adecuado para la defensa contra la denegación de servicio distribuida (Distributed Denial of Services, DDoS) que el problema tradicional de k-medias. En esta tesis se pretende contribuir en el problema agrupamiento en streaming tanto en entornos centralizados y distribuidos. Hemos diseñado un algoritmo centralizado de clustering mostrando las capacidades para descubrir agrupaciones de alta calidad en bajo tiempo frente a otras soluciones del estado del arte, en una amplia evaluación. Además, se ha trabajado sobre una estructura que reduce notablemente el espacio de memoria necesario, controlando, en todo momento, el error de los cómputos. Nuestro trabajo también proporciona dos protocolos de distribución del cómputo de agrupaciones. Se han analizado dos características fundamentales: el impacto sobre la calidad del clustering al realizar el cómputo distribuido y las condiciones necesarias para la reducción del tiempo de procesamiento frente a la solución centralizada. Finalmente, hemos desarrollado un entorno para la detección de ataques DDoS basado en agrupaciones. En este último caso, se ha caracterizado el tipo de ataques detectados y se ha desarrollado una evaluación sobre la eficiencia y eficacia de la mitigación del impacto del ataque. ABSTRACT Advances in hardware allow to collect huge volumes of data emerging applications that must provide information in near-real time, e.g., patient monitoring, health monitoring of water pipes, etc. The data streaming model emerges to comply with these applications overcoming the traditional store-then-process model. With the store-then-process model, data is stored before being consulted; while, in streaming, data are processed on the fly producing continuous responses. The challenges of streaming for processing data on the fly are the following: 1) responses must be produced continuously whenever new data arrives in the system; 2) data is accessed only once and is generally not maintained in its entirety, and 3) data processing time to produce a response should be low. Two models exist to compute continuous responses: the evolving model and the sliding window model; the latter fits best with applications must be computed over the most recently data rather than all the previous data. In recent years, research in the context of data stream mining has focused mainly on the evolving model. In the sliding window model, the work presented is smaller since these algorithms must be incremental and they must delete the information which expires when the window slides. Clustering is one of the fundamental techniques of data mining and is used to analyze data sets in order to find representative groups that provide a concise description of the data being processed. Clustering is critical in applications such as network intrusion detection or customer segmentation in marketing and advertising. Due to the huge amount of data that must be processed by such applications (up to millions of events per second), centralized solutions are usually unable to cope with timing restrictions and recur to shedding techniques where data is discarded during load peaks. To avoid discarding of data, processing of streams (such as clustering) must be distributed and adapted to environments where information is distributed. In streaming, research does not only focus on designing for general tasks, such as clustering, but also in finding new approaches that fit bests with particular scenarios. As an example, an ad-hoc grouping mechanism turns out to be more adequate than k-means for defense against Distributed Denial of Service (DDoS). This thesis contributes to the data stream mining clustering technique both for centralized and distributed environments. We present a centralized clustering algorithm showing capabilities to discover clusters of high quality in low time and we provide a comparison with existing state of the art solutions. We have worked on a data structure that significantly reduces memory requirements while controlling the error of the clusters statistics. We also provide two distributed clustering protocols. We focus on the analysis of two key features: the impact on the clustering quality when computation is distributed and the requirements for reducing the processing time compared to the centralized solution. Finally, with respect to ad-hoc grouping techniques, we have developed a DDoS detection framework based on clustering.We have characterized the attacks detected and we have evaluated the efficiency and effectiveness of mitigating the attack impact.
Resumo:
We propose the Route-back Delivery (RBD) protocol; a routing mechanism to create reverse routes exploiting the Collection Tree Protocol to allow unicast data dissemination from the sink. The main goal of this work is to provide a mechanism to enable bi-directional communications among the root(s) and specific sensor nodes in data gathering applications that does not use broadcast only mechanisms. The main objective of the root-to-remote-nodes route creation is to disseminate short messages to change application parameters in a unicast fashion. This facilitates remote configurability in heterogeneous WSN deployments.
Resumo:
Clinicians could model the brain injury of a patient through his brain activity. However, how this model is defined and how it changes when the patient is recovering are questions yet unanswered. In this paper, the use of MedVir framework is proposed with the aim of answering these questions. Based on complex data mining techniques, this provides not only the differentiation between TBI patients and control subjects (with a 72% of accuracy using 0.632 Bootstrap validation), but also the ability to detect whether a patient may recover or not, and all of that in a quick and easy way through a visualization technique which allows interaction.
Resumo:
The mobile apps market is a tremendous success, with millions of apps downloaded and used every day by users spread all around the world. For apps’ developers, having their apps published on one of the major app stores (e.g. Google Play market) is just the beginning of the apps lifecycle. Indeed, in order to successfully compete with the other apps in the market, an app has to be updated frequently by adding new attractive features and by fixing existing bugs. Clearly, any developer interested in increasing the success of her app should try to implement features desired by the app’s users and to fix bugs affecting the user experience of many of them. A precious source of information to decide how to collect users’ opinions and wishes is represented by the reviews left by users on the store from which they downloaded the app. However, to exploit such information the app’s developer should manually read each user review and verify if it contains useful information (e.g. suggestions for new features). This is something not doable if the app receives hundreds of reviews per day, as happens for the very popular apps on the market. In this work, our aim is to provide support to mobile apps developers by proposing a novel approach exploiting data mining, natural language processing, machine learning, and clustering techniques in order to classify the user reviews on the basis of the information they contain (e.g. useless, suggestion for new features, bugs reporting). Such an approach has been empirically evaluated and made available in a web-‐based tool publicly available to all apps’ developers. The achieved results showed that the developed tool: (i) is able to correctly categorise user reviews on the basis of their content (e.g. isolating those reporting bugs) with 78% of accuracy, (ii) produces clusters of reviews (e.g. groups together reviews indicating exactly the same bug to be fixed) that are meaningful from a developer’s point-‐of-‐view, and (iii) is considered useful by a software company working in the mobile apps’ development market.
Resumo:
Poder clasificar de manera precisa la aplicación o programa del que provienen los flujos que conforman el tráfico de uso de Internet dentro de una red permite tanto a empresas como a organismos una útil herramienta de gestión de los recursos de sus redes, así como la posibilidad de establecer políticas de prohibición o priorización de tráfico específico. La proliferación de nuevas aplicaciones y de nuevas técnicas han dificultado el uso de valores conocidos (well-known) en puertos de aplicaciones proporcionados por la IANA (Internet Assigned Numbers Authority) para la detección de dichas aplicaciones. Las redes P2P (Peer to Peer), el uso de puertos no conocidos o aleatorios, y el enmascaramiento de tráfico de muchas aplicaciones en tráfico HTTP y HTTPS con el fin de atravesar firewalls y NATs (Network Address Translation), entre otros, crea la necesidad de nuevos métodos de detección de tráfico. El objetivo de este estudio es desarrollar una serie de prácticas que permitan realizar dicha tarea a través de técnicas que están más allá de la observación de puertos y otros valores conocidos. Existen una serie de metodologías como Deep Packet Inspection (DPI) que se basa en la búsqueda de firmas, signatures, en base a patrones creados por el contenido de los paquetes, incluido el payload, que caracterizan cada aplicación. Otras basadas en el aprendizaje automático de parámetros de los flujos, Machine Learning, que permite determinar mediante análisis estadísticos a qué aplicación pueden pertenecer dichos flujos y, por último, técnicas de carácter más heurístico basadas en la intuición o el conocimiento propio sobre tráfico de red. En concreto, se propone el uso de alguna de las técnicas anteriormente comentadas en conjunto con técnicas de minería de datos como son el Análisis de Componentes Principales (PCA por sus siglas en inglés) y Clustering de estadísticos extraídos de los flujos procedentes de ficheros de tráfico de red. Esto implicará la configuración de diversos parámetros que precisarán de un proceso iterativo de prueba y error que permita dar con una clasificación del tráfico fiable. El resultado ideal sería aquel en el que se pudiera identificar cada aplicación presente en el tráfico en un clúster distinto, o en clusters que agrupen grupos de aplicaciones de similar naturaleza. Para ello, se crearán capturas de tráfico dentro de un entorno controlado e identificando cada tráfico con su aplicación correspondiente, a continuación se extraerán los flujos de dichas capturas. Tras esto, parámetros determinados de los paquetes pertenecientes a dichos flujos serán obtenidos, como por ejemplo la fecha y hora de llagada o la longitud en octetos del paquete IP. Estos parámetros serán cargados en una base de datos MySQL y serán usados para obtener estadísticos que ayuden, en un siguiente paso, a realizar una clasificación de los flujos mediante minería de datos. Concretamente, se usarán las técnicas de PCA y clustering haciendo uso del software RapidMiner. Por último, los resultados obtenidos serán plasmados en una matriz de confusión que nos permitirá que sean valorados correctamente. ABSTRACT. Being able to classify the applications that generate the traffic flows in an Internet network allows companies and organisms to implement efficient resource management policies such as prohibition of specific applications or prioritization of certain application traffic, looking for an optimization of the available bandwidth. The proliferation of new applications and new technics in the last years has made it more difficult to use well-known values assigned by the IANA (Internet Assigned Numbers Authority), like UDP and TCP ports, to identify the traffic. Also, P2P networks and data encapsulation over HTTP and HTTPS traffic has increased the necessity to improve these traffic analysis technics. The aim of this project is to develop a number of techniques that make us able to classify the traffic with more than the simple observation of the well-known ports. There are some proposals that have been created to cover this necessity; Deep Packet Inspection (DPI) tries to find signatures in the packets reading the information contained in them, the payload, looking for patterns that can be used to characterize the applications to which that traffic belongs; Machine Learning procedures work with statistical analysis of the flows, trying to generate an automatic process that learns from those statistical parameters and calculate the likelihood of a flow pertaining to a certain application; Heuristic Techniques, finally, are based in the intuition or the knowledge of the researcher himself about the traffic being analyzed that can help him to characterize the traffic. Specifically, the use of some of the techniques previously mentioned in combination with data mining technics such as Principal Component Analysis (PCA) and Clustering (grouping) of the flows extracted from network traffic captures are proposed. An iterative process based in success and failure will be needed to configure these data mining techniques looking for a reliable traffic classification. The perfect result would be the one in which the traffic flows of each application is grouped correctly in each cluster or in clusters that contain group of applications of similar nature. To do this, network traffic captures will be created in a controlled environment in which every capture is classified and known to pertain to a specific application. Then, for each capture, all the flows will be extracted. These flows will be used to extract from them information such as date and arrival time or the IP length of the packets inside them. This information will be then loaded to a MySQL database where all the packets defining a flow will be classified and also, each flow will be assigned to its specific application. All the information obtained from the packets will be used to generate statistical parameters in order to describe each flow in the best possible way. After that, data mining techniques previously mentioned (PCA and Clustering) will be used on these parameters making use of the software RapidMiner. Finally, the results obtained from the data mining will be compared with the real classification of the flows that can be obtained from the database. A Confusion Matrix will be used for the comparison, letting us measure the veracity of the developed classification process.
Resumo:
© The Author(s) 2014. Acknowledgements We thank the Information Services Division, Scotland, who provided the SMR01 data, and NHS Grampian, who provided the biochemistry data. We also thank the University of Aberdeen’s Data Management Team. Funding This work was supported by the Chief Scientists Office for Scotland (grant no. CZH/4/656).
Resumo:
El campo de procesamiento de lenguaje natural (PLN), ha tenido un gran crecimiento en los últimos años; sus áreas de investigación incluyen: recuperación y extracción de información, minería de datos, traducción automática, sistemas de búsquedas de respuestas, generación de resúmenes automáticos, análisis de sentimientos, entre otras. En este artículo se presentan conceptos y algunas herramientas con el fin de contribuir al entendimiento del procesamiento de texto con técnicas de PLN, con el propósito de extraer información relevante que pueda ser usada en un gran rango de aplicaciones. Se pueden desarrollar clasificadores automáticos que permitan categorizar documentos y recomendar etiquetas; estos clasificadores deben ser independientes de la plataforma, fácilmente personalizables para poder ser integrados en diferentes proyectos y que sean capaces de aprender a partir de ejemplos. En el presente artículo se introducen estos algoritmos de clasificación, se analizan algunas herramientas de código abierto disponibles actualmente para llevar a cabo estas tareas y se comparan diversas implementaciones utilizando la métrica F en la evaluación de los clasificadores.
Resumo:
The environmental, cultural and socio-economic causes and consequences of farmland abandonment are issues of increasing concern for researchers and policy makers. In previous studies, we proposed a new methodology for selecting the driving factors in farmland abandonment processes. Using Data Mining and GIS, it is possible to select those variables which are more significantly related to abandonment. The aim of this study is to investigate the application of the above mentioned methodology for finding relationships between relief and farmland abandonment in a Mediterranean region (SE Spain).We have taken into account up to 28 different variables in a single analysis, some of them commonly considered in land use change studies (slope, altitude, TWI, etc), but also other novel variables have been evaluated (sky view factor, terrain view factor, etc). The variable selection process provides results in line with the previous knowledge of the study area, describing some processes that are region specific (e.g. abandonment versus intensification of the agricultural activities). The European INSPIRE Directive (2007/2/EC) establishes that the digital elevation models for land surfaces should be available in all member countries, this means that the research described in this work can be extrapolated to any European country to determine whether these variables (slope, altitude, etc) are important in the process of abandonment.
Resumo:
Version 3.1 of the ISRIC-WISE database holds selected site and horizon data for some 10 250 soil profiles from 149 countries. Profile data were extracted from a wide range of sources and harmonized with respect to the original (1974) and revised (1988) Legend of the FAO-Unesco Soil Map of the World. Profiles have been described, sampled, and analysed according to methods and standards in use in the originating countries; analytical results for the same property cannot always be compared directly; as a result the amount of measured data available for modelling is sometimes much less than expected. WISE was specifically developed for land-related applications at continental and global scales.