856 resultados para Data Driven Clustering
Resumo:
In this work we present a quality driven approach to DASH (Dynamic Adaptive Streaming over HTTP) for segment selection in varying network conditions. Current adaption algorithms focus largely on regulating data rates using network layer parameters by selecting the level of quality on offer that can eliminate buffer underrun without considering picture fidelity. In reality, viewers may accept a level of buffer underrun in order to achieve an improved level of picture fidelity. In this case, the conventional DASH algorithms can cause extreme degradation of the picture fidelity when attempting to eliminate buffer underrun with scarce bandwidth availability. Our work is concerned with a quality-aware rate adaption scheme that maximizes the client's quality of experience in terms of both continuity and fidelity (picture quality). Results show that the scheme proposed can maintain a high level of quality for streaming services, especially at low packet loss rates. It is also shown that by eliminating buffer underrun completely, the PSNR that reflects the picture quality of the video is greatly reduced. Our scheme offers the offset between continuity-based quality and resolution-based quality, which can be used to set threshold values for the level of quality desired by clients with different quality requirements. © 2013 IEEE.
Resumo:
The microarray technology provides a high-throughput technique to study gene expression. Microarrays can help us diagnose different types of cancers, understand biological processes, assess host responses to drugs and pathogens, find markers for specific diseases, and much more. Microarray experiments generate large amounts of data. Thus, effective data processing and analysis are critical for making reliable inferences from the data. ^ The first part of dissertation addresses the problem of finding an optimal set of genes (biomarkers) to classify a set of samples as diseased or normal. Three statistical gene selection methods (GS, GS-NR, and GS-PCA) were developed to identify a set of genes that best differentiate between samples. A comparative study on different classification tools was performed and the best combinations of gene selection and classifiers for multi-class cancer classification were identified. For most of the benchmarking cancer data sets, the gene selection method proposed in this dissertation, GS, outperformed other gene selection methods. The classifiers based on Random Forests, neural network ensembles, and K-nearest neighbor (KNN) showed consistently god performance. A striking commonality among these classifiers is that they all use a committee-based approach, suggesting that ensemble classification methods are superior. ^ The same biological problem may be studied at different research labs and/or performed using different lab protocols or samples. In such situations, it is important to combine results from these efforts. The second part of the dissertation addresses the problem of pooling the results from different independent experiments to obtain improved results. Four statistical pooling techniques (Fisher inverse chi-square method, Logit method. Stouffer's Z transform method, and Liptak-Stouffer weighted Z-method) were investigated in this dissertation. These pooling techniques were applied to the problem of identifying cell cycle-regulated genes in two different yeast species. As a result, improved sets of cell cycle-regulated genes were identified. The last part of dissertation explores the effectiveness of wavelet data transforms for the task of clustering. Discrete wavelet transforms, with an appropriate choice of wavelet bases, were shown to be effective in producing clusters that were biologically more meaningful. ^
Resumo:
With the explosive growth of the volume and complexity of document data (e.g., news, blogs, web pages), it has become a necessity to semantically understand documents and deliver meaningful information to users. Areas dealing with these problems are crossing data mining, information retrieval, and machine learning. For example, document clustering and summarization are two fundamental techniques for understanding document data and have attracted much attention in recent years. Given a collection of documents, document clustering aims to partition them into different groups to provide efficient document browsing and navigation mechanisms. One unrevealed area in document clustering is that how to generate meaningful interpretation for the each document cluster resulted from the clustering process. Document summarization is another effective technique for document understanding, which generates a summary by selecting sentences that deliver the major or topic-relevant information in the original documents. How to improve the automatic summarization performance and apply it to newly emerging problems are two valuable research directions. To assist people to capture the semantics of documents effectively and efficiently, the dissertation focuses on developing effective data mining and machine learning algorithms and systems for (1) integrating document clustering and summarization to obtain meaningful document clusters with summarized interpretation, (2) improving document summarization performance and building document understanding systems to solve real-world applications, and (3) summarizing the differences and evolution of multiple document sources.
Resumo:
Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.
Resumo:
The accurate and reliable estimation of travel time based on point detector data is needed to support Intelligent Transportation System (ITS) applications. It has been found that the quality of travel time estimation is a function of the method used in the estimation and varies for different traffic conditions. In this study, two hybrid on-line travel time estimation models, and their corresponding off-line methods, were developed to achieve better estimation performance under various traffic conditions, including recurrent congestion and incidents. The first model combines the Mid-Point method, which is a speed-based method, with a traffic flow-based method. The second model integrates two speed-based methods: the Mid-Point method and the Minimum Speed method. In both models, the switch between travel time estimation methods is based on the congestion level and queue status automatically identified by clustering analysis. During incident conditions with rapidly changing queue lengths, shock wave analysis-based refinements are applied for on-line estimation to capture the fast queue propagation and recovery. Travel time estimates obtained from existing speed-based methods, traffic flow-based methods, and the models developed were tested using both simulation and real-world data. The results indicate that all tested methods performed at an acceptable level during periods of low congestion. However, their performances vary with an increase in congestion. Comparisons with other estimation methods also show that the developed hybrid models perform well in all cases. Further comparisons between the on-line and off-line travel time estimation methods reveal that off-line methods perform significantly better only during fast-changing congested conditions, such as during incidents. The impacts of major influential factors on the performance of travel time estimation, including data preprocessing procedures, detector errors, detector spacing, frequency of travel time updates to traveler information devices, travel time link length, and posted travel time range, were investigated in this study. The results show that these factors have more significant impacts on the estimation accuracy and reliability under congested conditions than during uncongested conditions. For the incident conditions, the estimation quality improves with the use of a short rolling period for data smoothing, more accurate detector data, and frequent travel time updates.
Resumo:
The rapid growth of virtualized data centers and cloud hosting services is making the management of physical resources such as CPU, memory, and I/O bandwidth in data center servers increasingly important. Server management now involves dealing with multiple dissimilar applications with varying Service-Level-Agreements (SLAs) and multiple resource dimensions. The multiplicity and diversity of resources and applications are rendering administrative tasks more complex and challenging. This thesis aimed to develop a framework and techniques that would help substantially reduce data center management complexity.^ We specifically addressed two crucial data center operations. First, we precisely estimated capacity requirements of client virtual machines (VMs) while renting server space in cloud environment. Second, we proposed a systematic process to efficiently allocate physical resources to hosted VMs in a data center. To realize these dual objectives, accurately capturing the effects of resource allocations on application performance is vital. The benefits of accurate application performance modeling are multifold. Cloud users can size their VMs appropriately and pay only for the resources that they need; service providers can also offer a new charging model based on the VMs performance instead of their configured sizes. As a result, clients will pay exactly for the performance they are actually experiencing; on the other hand, administrators will be able to maximize their total revenue by utilizing application performance models and SLAs. ^ This thesis made the following contributions. First, we identified resource control parameters crucial for distributing physical resources and characterizing contention for virtualized applications in a shared hosting environment. Second, we explored several modeling techniques and confirmed the suitability of two machine learning tools, Artificial Neural Network and Support Vector Machine, to accurately model the performance of virtualized applications. Moreover, we suggested and evaluated modeling optimizations necessary to improve prediction accuracy when using these modeling tools. Third, we presented an approach to optimal VM sizing by employing the performance models we created. Finally, we proposed a revenue-driven resource allocation algorithm which maximizes the SLA-generated revenue for a data center.^
Resumo:
The deployment of wireless communications coupled with the popularity of portable devices has led to significant research in the area of mobile data caching. Prior research has focused on the development of solutions that allow applications to run in wireless environments using proxy based techniques. Most of these approaches are semantic based and do not provide adequate support for representing the context of a user (i.e., the interpreted human intention.). Although the context may be treated implicitly it is still crucial to data management. In order to address this challenge this dissertation focuses on two characteristics: how to predict (i) the future location of the user and (ii) locations of the fetched data where the queried data item has valid answers. Using this approach, more complete information about the dynamics of an application environment is maintained. ^ The contribution of this dissertation is a novel data caching mechanism for pervasive computing environments that can adapt dynamically to a mobile user's context. In this dissertation, we design and develop a conceptual model and context aware protocols for wireless data caching management. Our replacement policy uses the validity of the data fetched from the server and the neighboring locations to decide which of the cache entries is less likely to be needed in the future, and therefore a good candidate for eviction when cache space is needed. The context aware driven prefetching algorithm exploits the query context to effectively guide the prefetching process. The query context is defined using a mobile user's movement pattern and requested information context. Numerical results and simulations show that the proposed prefetching and replacement policies significantly outperform conventional ones. ^ Anticipated applications of these solutions include biomedical engineering, tele-health, medical information systems and business. ^
Resumo:
Major portion of hurricane-induced economic loss originates from damages to building structures. The damages on building structures are typically grouped into three main categories: exterior, interior, and contents damage. Although the latter two types of damages, in most cases, cause more than 50% of the total loss, little has been done to investigate the physical damage process and unveil the interdependence of interior damage parameters. Building interior and contents damages are mainly due to wind-driven rain (WDR) intrusion through building envelope defects, breaches, and other functional openings. The limitation of research works and subsequent knowledge gaps, are in most part due to the complexity of damage phenomena during hurricanes and lack of established measurement methodologies to quantify rainwater intrusion. This dissertation focuses on devising methodologies for large-scale experimental simulation of tropical cyclone WDR and measurements of rainwater intrusion to acquire benchmark test-based data for the development of hurricane-induced building interior and contents damage model. Target WDR parameters derived from tropical cyclone rainfall data were used to simulate the WDR characteristics at the Wall of Wind (WOW) facility. The proposed WDR simulation methodology presents detailed procedures for selection of type and number of nozzles formulated based on tropical cyclone WDR study. The simulated WDR was later used to experimentally investigate the mechanisms of rainwater deposition/intrusion in buildings. Test-based dataset of two rainwater intrusion parameters that quantify the distribution of direct impinging raindrops and surface runoff rainwater over building surface — rain admittance factor (RAF) and surface runoff coefficient (SRC), respectively —were developed using common shapes of low-rise buildings. The dataset was applied to a newly formulated WDR estimation model to predict the volume of rainwater ingress through envelope openings such as wall and roof deck breaches and window sill cracks. The validation of the new model using experimental data indicated reasonable estimation of rainwater ingress through envelope defects and breaches during tropical cyclones. The WDR estimation model and experimental dataset of WDR parameters developed in this dissertation work can be used to enhance the prediction capabilities of existing interior damage models such as the Florida Public Hurricane Loss Model (FPHLM).^
Resumo:
Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.
Resumo:
In this study we investigate Pleistocene vegetation and climate change in southern East Africa by examining plant leaf waxes in a marine sediment core that receives terrestrial runoff from the Limpopo River. The plant leaf wax records are compared to a multi-proxy sea surface temperature (SST) record and pollen assemblage data from the same site. We find that Indian Ocean SST variability, driven by high-latitude obliquity, exerted a strong control on the vegetation of southern East Africa during the past 800,000 yr. Interglacial periods were characterized by relatively wetter and warmer conditions, increased contributions of C3 vegetation, and higher SST, whereas glacial periods were marked by cooler and arid conditions, increased contributions of C4 vegetation, and lower SST. We find that Marine Isotope Stages (MIS) 5e, 11c, 15e and 7a-7c are strongly expressed in the plant leaf wax records but MIS 7e is absent while MIS 9 is rather weak. Our plant leaf wax records also record the climate transition associated with the Mid-Brunhes Event (MBE) suggesting that the pre-MBE interval (430-800 ka) was characterized by higher inputs from grasses in comparison to relatively higher inputs from trees in the post-MBE interval (430 to 0 ka). Differences in vegetation and SST of southern East Africa between the pre- and post-MBE intervals appear to be related to shifts in the location of the Subtropical Front. Comparison with vegetation records from tropical East Africa indicates that the vegetation of southern East Africa, while exhibiting glacial-interglacial variability and notable differences between the pre- and post-MBE portions of the record, likely did not experience such dramatic extremes as occurred to the north at Lake Malawi.
Resumo:
The exponential growth of studies on the biological response to ocean acidification over the last few decades has generated a large amount of data. To facilitate data comparison, a data compilation hosted at the data publisher PANGAEA was initiated in 2008 and is updated on a regular basis (doi:10.1594/PANGAEA.149999). By January 2015, a total of 581 data sets (over 4 000 000 data points) from 539 papers had been archived. Here we present the developments of this data compilation five years since its first description by Nisumaa et al. (2010). Most of study sites from which data archived are still in the Northern Hemisphere and the number of archived data from studies from the Southern Hemisphere and polar oceans are still relatively low. Data from 60 studies that investigated the response of a mix of organisms or natural communities were all added after 2010, indicating a welcomed shift from the study of individual organisms to communities and ecosystems. The initial imbalance of considerably more data archived on calcification and primary production than on other processes has improved. There is also a clear tendency towards more data archived from multifactorial studies after 2010. For easier and more effective access to ocean acidification data, the ocean acidification community is strongly encouraged to contribute to the data archiving effort, and help develop standard vocabularies describing the variables and define best practices for archiving ocean acidification data.
Resumo:
Acknowledgements: We thank Iain Malcolm of Marine Scotland Science for access to data from the Girnock and the Scottish Environment Protection Agency for historical stage-discharge relationships. CS contributions on this paper were in part supported by the NERC/JPI SIWA project (NE/M019896/1).
Resumo:
This article is protected by copyright. All rights reserved. Acknowledgements This study was funded by a BBSRC studentship (MAW) and NERC grants NE/H00775X/1 and NE/D000602/1 (SBP). The authors are grateful to Mario Röder and Keliya Bai for fieldwork assistance, and all estate owners, factors and keepers for access to field sites, most particularly MJ Taylor and Mike Nisbet (Airlie), Neil Brown (Allargue), RR Gledson and David Scrimgeour (Delnadamph), Andrew Salvesen and John Hay (Dinnet), Stuart Young and Derek Calder (Edinglassie), Kirsty Donald and David Busfield (Glen Dye), Neil Hogbin and Ab Taylor (Glen Muick), Alistair Mitchell (Glenlivet), Simon Blackett, Jim Davidson and Liam Donald (Invercauld), Richard Cooke and Fred Taylor† (Invermark), Shaila Rao and Christopher Murphy (Mar Lodge), and Ralph Peters and Philip Astor (Tillypronie). Data accessibility • Genotype data (DataDryad: doi:10.5061/dryad.4t7jk) • Metadata (information on sampling sites, phenotypes and medication regimen) (DataDryad: doi:10.5061/dryad.4t7jk)
Resumo:
Purpose: This paper extends the use of Radio Frequency Identification (RFID) data for accounting of warehouse costs and services. Time Driven Activity Based Costing (TDABC) methodology is enhanced with the real-time collected RFID data about duration of warehouse activities. This allows warehouse managers to have accurate and instant calculations of costs. The RFID enhanced TDABC (RFID-TDABC) is proposed as a novel application of the RFID technology. Research Approach: Application of RFID-TDABC in a warehouse is implemented on warehouse processes of a case study company. Implementation covers receiving, put-away, order picking, and despatching. Findings and Originality: RFID technology is commonly used for the identification and tracking items. The use of the RFID generated information with the TDABC can be successfully extended to the area of costing. This RFID-TDABC costing model will benefit warehouse managers with accurate and instant calculations of costs. Research Impact: There are still unexplored benefits to RFID technology in its applications in warehousing and the wider supply chain. A multi-disciplinary research approach led to combining RFID technology and TDABC accounting method in order to propose RFID-TDABC. Combining methods and theories from different fields with RFID, may lead researchers to develop new techniques such as RFID-TDABC presented in this paper. Practical Impact: RFID-TDABC concept will be of value to practitioners by showing how warehouse costs can be accurately measured by using this approach. Providing better understanding of incurred costs may result in a further optimisation of warehousing operations, lowering costs of activities, and thus provide competitive pricing to customers. RFID-TDABC can be applied in a wider supply chain.
Resumo:
With the popularization of GPS-enabled devices such as mobile phones, location data are becoming available at an unprecedented scale. The locations may be collected from many different sources such as vehicles moving around a city, user check-ins in social networks, and geo-tagged micro-blogging photos or messages. Besides the longitude and latitude, each location record may also have a timestamp and additional information such as the name of the location. Time-ordered sequences of these locations form trajectories, which together contain useful high-level information about people's movement patterns.
The first part of this thesis focuses on a few geometric problems motivated by the matching and clustering of trajectories. We first give a new algorithm for computing a matching between a pair of curves under existing models such as dynamic time warping (DTW). The algorithm is more efficient than standard dynamic programming algorithms both theoretically and practically. We then propose a new matching model for trajectories that avoids the drawbacks of existing models. For trajectory clustering, we present an algorithm that computes clusters of subtrajectories, which correspond to common movement patterns. We also consider trajectories of check-ins, and propose a statistical generative model, which identifies check-in clusters as well as the transition patterns between the clusters.
The second part of the thesis considers the problem of covering shortest paths in a road network, motivated by an EV charging station placement problem. More specifically, a subset of vertices in the road network are selected to place charging stations so that every shortest path contains enough charging stations and can be traveled by an EV without draining the battery. We first introduce a general technique for the geometric set cover problem. This technique leads to near-linear-time approximation algorithms, which are the state-of-the-art algorithms for this problem in either running time or approximation ratio. We then use this technique to develop a near-linear-time algorithm for this
shortest-path cover problem.