845 resultados para Sign Data LMS algorithm.
Resumo:
Modern data centers host hundreds of thousands of servers to achieve economies of scale. Such a huge number of servers create challenges for the data center network (DCN) to provide proportionally large bandwidth. In addition, the deployment of virtual machines (VMs) in data centers raises the requirements for efficient resource allocation and find-grained resource sharing. Further, the large number of servers and switches in the data center consume significant amounts of energy. Even though servers become more energy efficient with various energy saving techniques, DCN still accounts for 20% to 50% of the energy consumed by the entire data center. The objective of this dissertation is to enhance DCN performance as well as its energy efficiency by conducting optimizations on both host and network sides. First, as the DCN demands huge bisection bandwidth to interconnect all the servers, we propose a parallel packet switch (PPS) architecture that directly processes variable length packets without segmentation-and-reassembly (SAR). The proposed PPS achieves large bandwidth by combining switching capacities of multiple fabrics, and it further improves the switch throughput by avoiding padding bits in SAR. Second, since certain resource demands of the VM are bursty and demonstrate stochastic nature, to satisfy both deterministic and stochastic demands in VM placement, we propose the Max-Min Multidimensional Stochastic Bin Packing (M3SBP) algorithm. M3SBP calculates an equivalent deterministic value for the stochastic demands, and maximizes the minimum resource utilization ratio of each server. Third, to provide necessary traffic isolation for VMs that share the same physical network adapter, we propose the Flow-level Bandwidth Provisioning (FBP) algorithm. By reducing the flow scheduling problem to multiple stages of packet queuing problems, FBP guarantees the provisioned bandwidth and delay performance for each flow. Finally, while DCNs are typically provisioned with full bisection bandwidth, DCN traffic demonstrates fluctuating patterns, we propose a joint host-network optimization scheme to enhance the energy efficiency of DCNs during off-peak traffic hours. The proposed scheme utilizes a unified representation method that converts the VM placement problem to a routing problem and employs depth-first and best-fit search to find efficient paths for flows.
Resumo:
The rapid growth of virtualized data centers and cloud hosting services is making the management of physical resources such as CPU, memory, and I/O bandwidth in data center servers increasingly important. Server management now involves dealing with multiple dissimilar applications with varying Service-Level-Agreements (SLAs) and multiple resource dimensions. The multiplicity and diversity of resources and applications are rendering administrative tasks more complex and challenging. This thesis aimed to develop a framework and techniques that would help substantially reduce data center management complexity. We specifically addressed two crucial data center operations. First, we precisely estimated capacity requirements of client virtual machines (VMs) while renting server space in cloud environment. Second, we proposed a systematic process to efficiently allocate physical resources to hosted VMs in a data center. To realize these dual objectives, accurately capturing the effects of resource allocations on application performance is vital. The benefits of accurate application performance modeling are multifold. Cloud users can size their VMs appropriately and pay only for the resources that they need; service providers can also offer a new charging model based on the VMs performance instead of their configured sizes. As a result, clients will pay exactly for the performance they are actually experiencing; on the other hand, administrators will be able to maximize their total revenue by utilizing application performance models and SLAs. This thesis made the following contributions. First, we identified resource control parameters crucial for distributing physical resources and characterizing contention for virtualized applications in a shared hosting environment. Second, we explored several modeling techniques and confirmed the suitability of two machine learning tools, Artificial Neural Network and Support Vector Machine, to accurately model the performance of virtualized applications. Moreover, we suggested and evaluated modeling optimizations necessary to improve prediction accuracy when using these modeling tools. Third, we presented an approach to optimal VM sizing by employing the performance models we created. Finally, we proposed a revenue-driven resource allocation algorithm which maximizes the SLA-generated revenue for a data center.
Resumo:
Thanks to the advanced technologies and social networks that allow the data to be widely shared among the Internet, there is an explosion of pervasive multimedia data, generating high demands of multimedia services and applications in various areas for people to easily access and manage multimedia data. Towards such demands, multimedia big data analysis has become an emerging hot topic in both industry and academia, which ranges from basic infrastructure, management, search, and mining to security, privacy, and applications. Within the scope of this dissertation, a multimedia big data analysis framework is proposed for semantic information management and retrieval with a focus on rare event detection in videos. The proposed framework is able to explore hidden semantic feature groups in multimedia data and incorporate temporal semantics, especially for video event detection. First, a hierarchical semantic data representation is presented to alleviate the semantic gap issue, and the Hidden Coherent Feature Group (HCFG) analysis method is proposed to capture the correlation between features and separate the original feature set into semantic groups, seamlessly integrating multimedia data in multiple modalities. Next, an Importance Factor based Temporal Multiple Correspondence Analysis (i.e., IF-TMCA) approach is presented for effective event detection. Specifically, the HCFG algorithm is integrated with the Hierarchical Information Gain Analysis (HIGA) method to generate the Importance Factor (IF) for producing the initial detection results. Then, the TMCA algorithm is proposed to efficiently incorporate temporal semantics for re-ranking and improving the final performance. At last, a sampling-based ensemble learning mechanism is applied to further accommodate the imbalanced datasets. In addition to the multimedia semantic representation and class imbalance problems, lack of organization is another critical issue for multimedia big data analysis. In this framework, an affinity propagation-based summarization method is also proposed to transform the unorganized data into a better structure with clean and well-organized information. The whole framework has been thoroughly evaluated across multiple domains, such as soccer goal event detection and disaster information management.
Resumo:
The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.
Resumo:
Non-orthogonal multiple access (NOMA) is emerging as a promising multiple access technology for the fifth generation cellular networks to address the fast growing mobile data traffic. It applies superposition coding in transmitters, allowing simultaneous allocation of the same frequency resource to multiple intra-cell users. Successive interference cancellation is used at the receivers to cancel intra-cell interference. User pairing and power allocation (UPPA) is a key design aspect of NOMA. Existing UPPA algorithms are mainly based on exhaustive search method with extensive computation complexity, which can severely affect the NOMA performance. A fast proportional fairness (PF) scheduling based UPPA algorithm is proposed to address the problem. The novel idea is to form user pairs around the users with the highest PF metrics with pre-configured fixed power allocation. Systemlevel simulation results show that the proposed algorithm is significantly faster (seven times faster for the scenario with 20 users) with a negligible throughput loss than the existing exhaustive search algorithm.
Resumo:
Proliferation of microglial cells has been considered a sign of glial activation and a hallmark of ongoing neurodegenerative diseases. Microglia activation is analyzed in animal models of different eye diseases. Numerous retinal samples are required for each of these studies to obtain relevant data of statistical significance. Because manual quantification of microglial cells is time consuming, the aim of this study was develop an algorithm for automatic identification of retinal microglia. Two groups of adult male Swiss mice were used: age-matched controls (naïve, n = 6) and mice subjected to unilateral laser-induced ocular hypertension (lasered; n = 9). In the latter group, both hypertensive eyes and contralateral untreated retinas were analyzed. Retinal whole mounts were immunostained with anti Iba-1 for detecting microglial cell populations. A new algorithm was developed in MATLAB for microglial quantification; it enabled the quantification of microglial cells in the inner and outer plexiform layers and evaluates the area of the retina occupied by Iba-1+ microglia in the nerve fiber-ganglion cell layer. The automatic method was applied to a set of 6,000 images. To validate the algorithm, mouse retinas were evaluated both manually and computationally; the program correctly assessed the number of cells (Pearson correlation R = 0.94 and R = 0.98 for the inner and outer plexiform layers respectively). Statistically significant differences in glial cell number were found between naïve, lasered eyes and contralateral eyes (P<0.05, naïve versus contralateral eyes; P<0.001, naïve versus lasered eyes and contralateral versus lasered eyes). The algorithm developed is a reliable and fast tool that can evaluate the number of microglial cells in naïve mouse retinas and in retinas exhibiting proliferation. The implementation of this new automatic method can enable faster quantification of microglial cells in retinal pathologies.
Resumo:
Background: Indices predictive of central obesity include waist circumference (WC) and waist-to-height ratio (WHtR). The aims of this study were 1) to establish a Colombian youth smoothed centile charts and LMS tables for WC and WHtR and 2) to evaluate the utility of these parameters as predictors of overweight and obesity. Method: A cross-sectional study whose sample population comprised 7954 healthy Colombian schoolchildren [boys n=3460 and girls n=4494, mean (standard deviation) age 12.8 (2.3) years old]. Weight, height, body mass index (BMI), WC and WHtR and its percentiles were calculated. Appropriate cut-offs point of WC and WHtR for overweight and obesity, as defined by the International Obesity Task Force (IOTF) definitions, were selected using receiver operating characteristic (ROC) analysis. The discriminating power of WC and WHtR was expressed as area under the curve (AUC). Results: Reference values for WC and WHtR are presented. Mean WC increased and WHtR decreased with age for both genders. We found a moderate positive correlation between WC and BMI (r= 0.756, P < 0.01) and WHtR and BMI (r= 0.604, P < 0.01). The ROC analysis showed a high discrimination power in the identification of overweight and obesity for both measures in our sample population. Overall, WHtR was slightly a better predictor for overweight/obesity (AUC 95% CI 0.868-0.916) than the WC (AUC 95% CI 0.862-0.904). Conclusion: This paper presents the first sex- and age-specific WC and WHtR percentiles for both measures among Colombian children and adolescents aged 9–17.9 years. By providing LMS tables for Latin-American people based on Colombian reference data, we hope to provide quantitative tools for the study of obesity and its comorbidities.
Resumo:
This paper focus on the development of an algorithm using Matlab to generate Typical Meteorological Years from weather data of eight locations in the Madeira Island and to predict the energy generation of photovoltaic systems based on solar cells modelling. Solar cells model includes the effect of ambient temperature and wind speed. The analysis of the PV system performance is carried out through the Weather Corrected Performance Ratio and the PV system yield for the entire island is estimated using spatial interpolation tools.
Resumo:
This short paper presents a numerical method for spatial and temporal downscaling of solar global radiation and mean air temperature data from global weather forecast models and its validation. The final objective is to develop a prediction algorithm to be integrated in energy management models and forecast of energy harvesting in solar thermal systems of medium/low temperature. Initially, hourly prediction and measurement data of solar global radiation and mean air temperature were obtained, being then numerically downscaled to half-hourly prediction values for the location where measurements were taken. The differences between predictions and measurements were analyzed for more than one year of data of mean air temperature and solar global radiation on clear sky days, resulting in relative daily deviations of around -0.9±3.8% and 0.02±3.92%, respectively.
Resumo:
With the advent of new technologies it is increasingly easier to find data of different nature from even more accurate sensors that measure the most disparate physical quantities and with different methodologies. The collection of data thus becomes progressively important and takes the form of archiving, cataloging and online and offline consultation of information. Over time, the amount of data collected can become so relevant that it contains information that cannot be easily explored manually or with basic statistical techniques. The use of Big Data therefore becomes the object of more advanced investigation techniques, such as Machine Learning and Deep Learning. In this work some applications in the world of precision zootechnics and heat stress accused by dairy cows are described. Experimental Italian and German stables were involved for the training and testing of the Random Forest algorithm, obtaining a prediction of milk production depending on the microclimatic conditions of the previous days with satisfactory accuracy. Furthermore, in order to identify an objective method for identifying production drops, compared to the Wood model, typically used as an analytical model of the lactation curve, a Robust Statistics technique was used. Its application on some sample lactations and the results obtained allow us to be confident about the use of this method in the future.
Resumo:
Today’s data are increasingly complex and classical statistical techniques need growingly more refined mathematical tools to be able to model and investigate them. Paradigmatic situations are represented by data which need to be considered up to some kind of trans- formation and all those circumstances in which the analyst finds himself in the need of defining a general concept of shape. Topological Data Analysis (TDA) is a field which is fundamentally contributing to such challenges by extracting topological information from data with a plethora of interpretable and computationally accessible pipelines. We con- tribute to this field by developing a series of novel tools, techniques and applications to work with a particular topological summary called merge tree. To analyze sets of merge trees we introduce a novel metric structure along with an algorithm to compute it, define a framework to compare different functions defined on merge trees and investigate the metric space obtained with the aforementioned metric. Different geometric and topolog- ical properties of the space of merge trees are established, with the aim of obtaining a deeper understanding of such trees. To showcase the effectiveness of the proposed metric, we develop an application in the field of Functional Data Analysis, working with functions up to homeomorphic reparametrization, and in the field of radiomics, where each patient is represented via a clustering dendrogram.
Resumo:
The coastal ocean is a complex environment with extremely dynamic processes that require a high-resolution and cross-scale modeling approach in which all hydrodynamic fields and scales are considered integral parts of the overall system. In the last decade, unstructured-grid models have been used to advance in seamless modeling between scales. On the other hand, the data assimilation methodologies to improve the unstructured-grid models in the coastal seas have been developed only recently and need significant advancements. Here, we link the unstructured-grid ocean modeling to the variational data assimilation methods. In particular, we show results from the modeling system SANIFS based on SHYFEM fully-baroclinic unstructured-grid model interfaced with OceanVar, a state-of-art variational data assimilation scheme adopted for several systems based on a structured grid. OceanVar implements a 3DVar DA scheme. The combination of three linear operators models the background error covariance matrix. The vertical part is represented using multivariate EOFs for temperature, salinity, and sea level anomaly. The horizontal part is assumed to be Gaussian isotropic and is modeled using a first-order recursive filter algorithm designed for structured and regular grids. Here we introduced a novel recursive filter algorithm for unstructured grids. A local hydrostatic adjustment scheme models the rapidly evolving part of the background error covariance. We designed two data assimilation experiments using SANIFS implementation interfaced with OceanVar over the period 2017-2018, one with only temperature and salinity assimilation by Argo profiles and the second also including sea level anomaly. The results showed a successful implementation of the approach and the added value of the assimilation for the active tracer fields. While looking at the broad basin, no significant improvements are highlighted for the sea level, requiring future investigations. Furthermore, a Machine Learning methodology based on an LSTM network has been used to predict the model SST increments.
Resumo:
In this thesis, a search for same-sign top quark pairs produced according to the Standard Model Effective Field Theory (SMEFT) is presented. The analysis is carried out within the ATLAS Collaboration using collision data at a center-of-mass energy of $\sqrt{s} = 13$ TeV, collected by the ATLAS detector during the Run 2 of the Large Hadron Collider, corresponding to an integrated luminosity of $140$ fb$^{-1}$. Three SMEFT operators are considered in the analysis, namely $\mathcal{O}_{RR}$, $\mathcal{O}_{LR}^{(1)}$, and $\mathcal{O}_{LR}^{(8)}$. The signal associated to same-sign top pairs is searched in the dilepton channel, with the top quarks decaying via $t \longrightarrow W^+ b \longrightarrow \ell^+ \nu b$, leading to a final state signature composed of a pair of high-transverse momentum same-sign leptons and $b$-jets. Deep Neural Networks are employed in the analysis to enhance sensitivity to the different SMEFT operators and to perform signal-background discrimination. This is the first result of the ATLAS Collaboration concerning the search for same-sign top quark pairs production in proton-proton collision data at $\sqrt{s} = 13$ TeV, in the framework of the SMEFT.
Resumo:
Machine Learning makes computers capable of performing tasks typically requiring human intelligence. A domain where it is having a considerable impact is the life sciences, allowing to devise new biological analysis protocols, develop patients’ treatments efficiently and faster, and reduce healthcare costs. This Thesis work presents new Machine Learning methods and pipelines for the life sciences focusing on the unsupervised field. At a methodological level, two methods are presented. The first is an “Ab Initio Local Principal Path” and it is a revised and improved version of a pre-existing algorithm in the manifold learning realm. The second contribution is an improvement over the Import Vector Domain Description (one-class learning) through the Kullback-Leibler divergence. It hybridizes kernel methods to Deep Learning obtaining a scalable solution, an improved probabilistic model, and state-of-the-art performances. Both methods are tested through several experiments, with a central focus on their relevance in life sciences. Results show that they improve the performances achieved by their previous versions. At the applicative level, two pipelines are presented. The first one is for the analysis of RNA-Seq datasets, both transcriptomic and single-cell data, and is aimed at identifying genes that may be involved in biological processes (e.g., the transition of tissues from normal to cancer). In this project, an R package is released on CRAN to make the pipeline accessible to the bioinformatic Community through high-level APIs. The second pipeline is in the drug discovery domain and is useful for identifying druggable pockets, namely regions of a protein with a high probability of accepting a small molecule (a drug). Both these pipelines achieve remarkable results. Lastly, a detour application is developed to identify the strengths/limitations of the “Principal Path” algorithm by analyzing Convolutional Neural Networks induced vector spaces. This application is conducted in the music and visual arts domains.
Resumo:
Artificial Intelligence (AI) and Machine Learning (ML) are novel data analysis techniques providing very accurate prediction results. They are widely adopted in a variety of industries to improve efficiency and decision-making, but they are also being used to develop intelligent systems. Their success grounds upon complex mathematical models, whose decisions and rationale are usually difficult to comprehend for human users to the point of being dubbed as black-boxes. This is particularly relevant in sensitive and highly regulated domains. To mitigate and possibly solve this issue, the Explainable AI (XAI) field became prominent in recent years. XAI consists of models and techniques to enable understanding of the intricated patterns discovered by black-box models. In this thesis, we consider model-agnostic XAI techniques, which can be applied to Tabular data, with a particular focus on the Credit Scoring domain. Special attention is dedicated to the LIME framework, for which we propose several modifications to the vanilla algorithm, in particular: a pair of complementary Stability Indices that accurately measure LIME stability, and the OptiLIME policy which helps the practitioner finding the proper balance among explanations' stability and reliability. We subsequently put forward GLEAMS a model-agnostic surrogate interpretable model which requires to be trained only once, while providing both Local and Global explanations of the black-box model. GLEAMS produces feature attributions and what-if scenarios, from both dataset and model perspective. Eventually, we argue that synthetic data are an emerging trend in AI, being more and more used to train complex models instead of original data. To be able to explain the outcomes of such models, we must guarantee that synthetic data are reliable enough to be able to translate their explanations to real-world individuals. To this end we propose DAISYnt, a suite of tests to measure synthetic tabular data quality and privacy.