909 resultados para Data accuracy
Resumo:
Data mining can be defined as the extraction of implicit, previously un-known, and potentially useful information from data. Numerous re-searchers have been developing security technology and exploring new methods to detect cyber-attacks with the DARPA 1998 dataset for Intrusion Detection and the modified versions of this dataset KDDCup99 and NSL-KDD, but until now no one have examined the performance of the Top 10 data mining algorithms selected by experts in data mining. The compared classification learning algorithms in this thesis are: C4.5, CART, k-NN and Naïve Bayes. The performance of these algorithms are compared with accuracy, error rate and average cost on modified versions of NSL-KDD train and test dataset where the instances are classified into normal and four cyber-attack categories: DoS, Probing, R2L and U2R. Additionally the most important features to detect cyber-attacks in all categories and in each category are evaluated with Weka’s Attribute Evaluator and ranked according to Information Gain. The results show that the classification algorithm with best performance on the dataset is the k-NN algorithm. The most important features to detect cyber-attacks are basic features such as the number of seconds of a network connection, the protocol used for the connection, the network service used, normal or error status of the connection and the number of data bytes sent. The most important features to detect DoS, Probing and R2L attacks are basic features and the least important features are content features. Unlike U2R attacks, where the content features are the most important features to detect attacks.
Resumo:
Temporal replicate counts are often aggregated to improve model fit by reducing zero-inflation and count variability, and in the case of migration counts collected hourly throughout a migration, allows one to ignore nonindependence. However, aggregation can represent a loss of potentially useful information on the hourly or seasonal distribution of counts, which might impact our ability to estimate reliable trends. We simulated 20-year hourly raptor migration count datasets with known rate of change to test the effect of aggregating hourly counts to daily or annual totals on our ability to recover known trend. We simulated data for three types of species, to test whether results varied with species abundance or migration strategy: a commonly detected species, e.g., Northern Harrier, Circus cyaneus; a rarely detected species, e.g., Peregrine Falcon, Falco peregrinus; and a species typically counted in large aggregations with overdispersed counts, e.g., Broad-winged Hawk, Buteo platypterus. We compared accuracy and precision of estimated trends across species and count types (hourly/daily/annual) using hierarchical models that assumed a Poisson, negative binomial (NB) or zero-inflated negative binomial (ZINB) count distribution. We found little benefit of modeling zero-inflation or of modeling the hourly distribution of migration counts. For the rare species, trends analyzed using daily totals and an NB or ZINB data distribution resulted in a higher probability of detecting an accurate and precise trend. In contrast, trends of the common and overdispersed species benefited from aggregation to annual totals, and for the overdispersed species in particular, trends estimating using annual totals were more precise, and resulted in lower probabilities of estimating a trend (1) in the wrong direction, or (2) with credible intervals that excluded the true trend, as compared with hourly and daily counts.
Resumo:
In Germany the upscaling algorithm is currently the standard approach for evaluating the PV power produced in a region. This method involves spatially interpolating the normalized power of a set of reference PV plants to estimate the power production by another set of unknown plants. As little information on the performances of this method could be found in the literature, the first goal of this thesis is to conduct an analysis of the uncertainty associated to this method. It was found that this method can lead to large errors when the set of reference plants has different characteristics or weather conditions than the set of unknown plants and when the set of reference plants is small. Based on these preliminary findings, an alternative method is proposed for calculating the aggregate power production of a set of PV plants. A probabilistic approach has been chosen by which a power production is calculated at each PV plant from corresponding weather data. The probabilistic approach consists of evaluating the power for each frequently occurring value of the parameters and estimating the most probable value by averaging these power values weighted by their frequency of occurrence. Most frequent parameter sets (e.g. module azimuth and tilt angle) and their frequency of occurrence have been assessed on the basis of a statistical analysis of parameters of approx. 35 000 PV plants. It has been found that the plant parameters are statistically dependent on the size and location of the PV plants. Accordingly, separate statistical values have been assessed for 14 classes of nominal capacity and 95 regions in Germany (two-digit zip-code areas). The performances of the upscaling and probabilistic approaches have been compared on the basis of 15 min power measurements from 715 PV plants provided by the German distribution system operator LEW Verteilnetz. It was found that the error of the probabilistic method is smaller than that of the upscaling method when the number of reference plants is sufficiently large (>100 reference plants in the case study considered in this chapter). When the number of reference plants is limited (<50 reference plants for the considered case study), it was found that the proposed approach provides a noticeable gain in accuracy with respect to the upscaling method.
Resumo:
The Twitter System is the biggest social network in the world, and everyday millions of tweets are posted and talked about, expressing various views and opinions. A large variety of research activities have been conducted to study how the opinions can be clustered and analyzed, so that some tendencies can be uncovered. Due to the inherent weaknesses of the tweets - very short texts and very informal styles of writing - it is rather hard to make an investigation of tweet data analysis giving results with good performance and accuracy. In this paper, we intend to attack the problem from another aspect - using a two-layer structure to analyze the twitter data: LDA with topic map modelling. The experimental results demonstrate that this approach shows a progress in twitter data analysis. However, more experiments with this method are expected in order to ensure that the accurate analytic results can be maintained.
Resumo:
Recommendation systems aim to help users make decisions more efficiently. The most widely used method in recommendation systems is collaborative filtering, of which, a critical step is to analyze a user's preferences and make recommendations of products or services based on similarity analysis with other users' ratings. However, collaborative filtering is less usable for recommendation facing the "cold start" problem, i.e. few comments being given to products or services. To tackle this problem, we propose an improved method that combines collaborative filtering and data classification. We use hotel recommendation data to test the proposed method. The accuracy of the recommendation is determined by the rankings. Evaluations regarding the accuracies of Top-3 and Top-10 recommendation lists using the 10-fold cross-validation method and ROC curves are conducted. The results show that the Top-3 hotel recommendation list proposed by the combined method has the superiority of the recommendation performance than the Top-10 list under the cold start condition in most of the times.
Resumo:
Surface flow types (SFT) are advocated as ecologically relevant hydraulic units, often mapped visually from the bankside to characterise rapidly the physical habitat of rivers. SFT mapping is simple, non-invasive and cost-efficient. However, it is also qualitative, subjective and plagued by difficulties in recording accurately the spatial extent of SFT units. Quantitative validation of the underlying physical habitat parameters is often lacking, and does not consistently differentiate between SFTs. Here, we investigate explicitly the accuracy, reliability and statistical separability of traditionally mapped SFTs as indicators of physical habitat, using independent, hydraulic and topographic data collected during three surveys of a c. 50m reach of the River Arrow, Warwickshire, England. We also explore the potential of a novel remote sensing approach, comprising a small unmanned aerial system (sUAS) and Structure-from-Motion photogrammetry (SfM), as an alternative method of physical habitat characterisation. Our key findings indicate that SFT mapping accuracy is highly variable, with overall mapping accuracy not exceeding 74%. Results from analysis of similarity (ANOSIM) tests found that strong differences did not exist between all SFT pairs. This leads us to question the suitability of SFTs for characterising physical habitat for river science and management applications. In contrast, the sUAS-SfM approach provided high resolution, spatially continuous, spatially explicit, quantitative measurements of water depth and point cloud roughness at the microscale (spatial scales ≤1m). Such data are acquired rapidly, inexpensively, and provide new opportunities for examining the heterogeneity of physical habitat over a range of spatial and temporal scales. Whilst continued refinement of the sUAS-SfM approach is required, we propose that this method offers an opportunity to move away from broad, mesoscale classifications of physical habitat (spatial scales 10-100m), and towards continuous, quantitative measurements of the continuum of hydraulic and geomorphic conditions which actually exists at the microscale.
Resumo:
Yield management helps hotels more profitably manage the capacity of their rooms. Hotels tend to have two types of business: transient and group. Yield management research and systems have been designed for transient business in which the group forecast is taken as a given. In this research, forecast data from approximately 90 hotels of a large North American hotel chain were used to determine the accuracy of group forecasts and to identify factors associated with accurate forecasts. Forecasts showed a positive bias and had a mean absolute percentage error (MAPE) of 40% at two months before arrival; 30% at one month before arrival; and 10-15% on the day of arrival. Larger hotels, hotels with a higher dependence on group business, and hotels that updated their forecasts frequently during the month before arrival had more accurate forecasts.
Resumo:
The use of remote sensing for monitoring of submerged aquatic vegetation (SAV) in fluvial environments has been limited by the spatial and spectral resolution of available image data. The absorption of light in water also complicates the use of common image analysis methods. This paper presents the results of a study that uses very high resolution (VHR) image data, collected with a Near Infrared sensitive DSLR camera, to map the distribution of SAV species for three sites along the Desselse Nete, a lowland river in Flanders, Belgium. Plant species, including Ranunculus aquatilis L., Callitriche obtusangula Le Gall, Potamogeton natans L., Sparganium emersum L. and Potamogeton crispus L., were classified from the data using Object-Based Image Analysis (OBIA) and expert knowledge. A classification rule set based on a combination of both spectral and structural image variation (e.g. texture and shape) was developed for images from two sites. A comparison of the classifications with manually delineated ground truth maps resulted for both sites in 61% overall accuracy. Application of the rule set to a third validation image, resulted in 53% overall accuracy. These consistent results show promise for species level mapping in such biodiverse environments, but also prompt a discussion on assessment of classification accuracy.
Resumo:
This dissertation contains four essays that all share a common purpose: developing new methodologies to exploit the potential of high-frequency data for the measurement, modeling and forecasting of financial assets volatility and correlations. The first two chapters provide useful tools for univariate applications while the last two chapters develop multivariate methodologies. In chapter 1, we introduce a new class of univariate volatility models named FloGARCH models. FloGARCH models provide a parsimonious joint model for low frequency returns and realized measures, and are sufficiently flexible to capture long memory as well as asymmetries related to leverage effects. We analyze the performances of the models in a realistic numerical study and on the basis of a data set composed of 65 equities. Using more than 10 years of high-frequency transactions, we document significant statistical gains related to the FloGARCH models in terms of in-sample fit, out-of-sample fit and forecasting accuracy compared to classical and Realized GARCH models. In chapter 2, using 12 years of high-frequency transactions for 55 U.S. stocks, we argue that combining low-frequency exogenous economic indicators with high-frequency financial data improves the ability of conditionally heteroskedastic models to forecast the volatility of returns, their full multi-step ahead conditional distribution and the multi-period Value-at-Risk. Using a refined version of the Realized LGARCH model allowing for time-varying intercept and implemented with realized kernels, we document that nominal corporate profits and term spreads have strong long-run predictive ability and generate accurate risk measures forecasts over long-horizon. The results are based on several loss functions and tests, including the Model Confidence Set. Chapter 3 is a joint work with David Veredas. We study the class of disentangled realized estimators for the integrated covariance matrix of Brownian semimartingales with finite activity jumps. These estimators separate correlations and volatilities. We analyze different combinations of quantile- and median-based realized volatilities, and four estimators of realized correlations with three synchronization schemes. Their finite sample properties are studied under four data generating processes, in presence, or not, of microstructure noise, and under synchronous and asynchronous trading. The main finding is that the pre-averaged version of disentangled estimators based on Gaussian ranks (for the correlations) and median deviations (for the volatilities) provide a precise, computationally efficient, and easy alternative to measure integrated covariances on the basis of noisy and asynchronous prices. Along these lines, a minimum variance portfolio application shows the superiority of this disentangled realized estimator in terms of numerous performance metrics. Chapter 4 is co-authored with Niels S. Hansen, Asger Lunde and Kasper V. Olesen, all affiliated with CREATES at Aarhus University. We propose to use the Realized Beta GARCH model to exploit the potential of high-frequency data in commodity markets. The model produces high quality forecasts of pairwise correlations between commodities which can be used to construct a composite covariance matrix. We evaluate the quality of this matrix in a portfolio context and compare it to models used in the industry. We demonstrate significant economic gains in a realistic setting including short selling constraints and transaction costs.
Resumo:
We present new methodologies to generate rational function approximations of broadband electromagnetic responses of linear and passive networks of high-speed interconnects, and to construct SPICE-compatible, equivalent circuit representations of the generated rational functions. These new methodologies are driven by the desire to improve the computational efficiency of the rational function fitting process, and to ensure enhanced accuracy of the generated rational function interpolation and its equivalent circuit representation. Toward this goal, we propose two new methodologies for rational function approximation of high-speed interconnect network responses. The first one relies on the use of both time-domain and frequency-domain data, obtained either through measurement or numerical simulation, to generate a rational function representation that extrapolates the input, early-time transient response data to late-time response while at the same time providing a means to both interpolate and extrapolate the used frequency-domain data. The aforementioned hybrid methodology can be considered as a generalization of the frequency-domain rational function fitting utilizing frequency-domain response data only, and the time-domain rational function fitting utilizing transient response data only. In this context, a guideline is proposed for estimating the order of the rational function approximation from transient data. The availability of such an estimate expedites the time-domain rational function fitting process. The second approach relies on the extraction of the delay associated with causal electromagnetic responses of interconnect systems to provide for a more stable rational function process utilizing a lower-order rational function interpolation. A distinctive feature of the proposed methodology is its utilization of scattering parameters. For both methodologies, the approach of fitting the electromagnetic network matrix one element at a time is applied. It is shown that, with regard to the computational cost of the rational function fitting process, such an element-by-element rational function fitting is more advantageous than full matrix fitting for systems with a large number of ports. Despite the disadvantage that different sets of poles are used in the rational function of different elements in the network matrix, such an approach provides for improved accuracy in the fitting of network matrices of systems characterized by both strongly coupled and weakly coupled ports. Finally, in order to provide a means for enforcing passivity in the adopted element-by-element rational function fitting approach, the methodology for passivity enforcement via quadratic programming is modified appropriately for this purpose and demonstrated in the context of element-by-element rational function fitting of the admittance matrix of an electromagnetic multiport.
Resumo:
Tässä työssä perehdytään soodakattiloiden vesikiertomallin rakentamiseen. Työn päätavoitteena on kehittää simulointimallia varten taulukkolaskentapohja, jonka avulla soodakattilan lämpövuotietoja on yksinkertaista ja nopeaa käsitellä ja siirtää Apros 6 -simulointiohjelmaan. Lisäksi tarkoituksena on pyrkiä automatisoimaan työvaiheet mahdollisimman pitkälle, jolloin vesikiertolaskennan tekeminen yksinkertaistuisi, yhtenäistyisi ja tarkentuisi. Tämä on mahdollista Excel- makrojen ja Apros 6:n uusien toimintojen avulla. Apros 6:ssa on nyt mahdollista hyödyntää SCL- komentotiedostoja, joiden avulla sujuva tiedonsiirto Aproksen ja Excelin välillä vodaan toteuttaa. Vesikiertolaskentaan käytettävän datan käsittely on aikaisemmin ollut työlästä ja sen tarkkuus on pitkälti riippunut mallintajasta. Tässä diplomityössä päästään hyödyntämään uusimpia ja realistisempia soodakattiloiden CFD- malleja, joiden avulla pystytään luomaan aikaisempaa tarkemmat lämpövuojakaumat soodakattilan lämpöpinnoille. Tämä muutos parantaa vesikiertolaskennan tarkkuutta. Työn kokeellisessa osassa uutta Excel laskentatyökalua ja uusia lämpövuoarvoja testataan käytännössä. Eräs vanha Apros- vesikiertomalli päivitetään uusilla lämpövuoarvoilla ja sen rakenteeseen tehdään muutoksia tarkkuuden parantamiseksi. Uuden mallin toimivuutta testataan myös 115 %:n kapasiteetilla ja tutkitaan kuinka kyseinen vesikiertopiiri reagoi suurempaan lämpötehoon. Näitä kolmea eri tilannetta vertaillaan toisiinsa ja tarkastellaan eroavaisuuksia niiden vesi-höyrypiireissä.
Resumo:
This thesis investigates how web search evaluation can be improved using historical interaction data. Modern search engines combine offline and online evaluation approaches in a sequence of steps that a tested change needs to pass through to be accepted as an improvement and subsequently deployed. We refer to such a sequence of steps as an evaluation pipeline. In this thesis, we consider the evaluation pipeline to contain three sequential steps: an offline evaluation step, an online evaluation scheduling step, and an online evaluation step. In this thesis we show that historical user interaction data can aid in improving the accuracy or efficiency of each of the steps of the web search evaluation pipeline. As a result of these improvements, the overall efficiency of the entire evaluation pipeline is increased. Firstly, we investigate how user interaction data can be used to build accurate offline evaluation methods for query auto-completion mechanisms. We propose a family of offline evaluation metrics for query auto-completion that represents the effort the user has to spend in order to submit their query. The parameters of our proposed metrics are trained against a set of user interactions recorded in the search engine’s query logs. From our experimental study, we observe that our proposed metrics are significantly more correlated with an online user satisfaction indicator than the metrics proposed in the existing literature. Hence, fewer changes will pass the offline evaluation step to be rejected after the online evaluation step. As a result, this would allow us to achieve a higher efficiency of the entire evaluation pipeline. Secondly, we state the problem of the optimised scheduling of online experiments. We tackle this problem by considering a greedy scheduler that prioritises the evaluation queue according to the predicted likelihood of success of a particular experiment. This predictor is trained on a set of online experiments, and uses a diverse set of features to represent an online experiment. Our study demonstrates that a higher number of successful experiments per unit of time can be achieved by deploying such a scheduler on the second step of the evaluation pipeline. Consequently, we argue that the efficiency of the evaluation pipeline can be increased. Next, to improve the efficiency of the online evaluation step, we propose the Generalised Team Draft interleaving framework. Generalised Team Draft considers both the interleaving policy (how often a particular combination of results is shown) and click scoring (how important each click is) as parameters in a data-driven optimisation of the interleaving sensitivity. Further, Generalised Team Draft is applicable beyond domains with a list-based representation of results, i.e. in domains with a grid-based representation, such as image search. Our study using datasets of interleaving experiments performed both in document and image search domains demonstrates that Generalised Team Draft achieves the highest sensitivity. A higher sensitivity indicates that the interleaving experiments can be deployed for a shorter period of time or use a smaller sample of users. Importantly, Generalised Team Draft optimises the interleaving parameters w.r.t. historical interaction data recorded in the interleaving experiments. Finally, we propose to apply the sequential testing methods to reduce the mean deployment time for the interleaving experiments. We adapt two sequential tests for the interleaving experimentation. We demonstrate that one can achieve a significant decrease in experiment duration by using such sequential testing methods. The highest efficiency is achieved by the sequential tests that adjust their stopping thresholds using historical interaction data recorded in diagnostic experiments. Our further experimental study demonstrates that cumulative gains in the online experimentation efficiency can be achieved by combining the interleaving sensitivity optimisation approaches, including Generalised Team Draft, and the sequential testing approaches. Overall, the central contributions of this thesis are the proposed approaches to improve the accuracy or efficiency of the steps of the evaluation pipeline: the offline evaluation frameworks for the query auto-completion, an approach for the optimised scheduling of online experiments, a general framework for the efficient online interleaving evaluation, and a sequential testing approach for the online search evaluation. The experiments in this thesis are based on massive real-life datasets obtained from Yandex, a leading commercial search engine. These experiments demonstrate the potential of the proposed approaches to improve the efficiency of the evaluation pipeline.
Resumo:
A ecografia é o exame de primeira linha na identificação e caraterização de tumores anexiais. Foram descritos diversos métodos de diagnóstico diferencial incluindo a avaliação subjetiva do observador, índices descritivos simples e índices matematicamente desenvolvidos como modelos de regressão logística, continuando a avaliação subjectiva por examinador diferenciado a ser o melhor método de discriminação entre tumores malignos e benignos. No entanto, devido à subjectividade inerente a esta avaliação tornouse necessário estabelecer uma nomenclatura padronizada e uma classificação que facilitasse a comunicação de resultados e respectivas recomendações de vigilância. O objetivo deste artigo é resumir e comparar diferentes métodos de avaliação e classificação de tumores anexiais, nomeadamente os modelos do grupo International Ovary Tumor Analysis (IOTA) e a classificação Gynecologic Imaging Report and Data System (GI-RADS), em termos de desempenho diagnóstico e utilidade na prática clínica.
Resumo:
Recommendation for Oxygen Measurements from Argo Floats: Implementation of In-Air-Measurement Routine to Assure Highest Long-term Accuracy As Argo has entered its second decade and chemical/biological sensor technology is improving constantly, the marine biogeochemistry community is starting to embrace the successful Argo float program. An augmentation of the global float observatory, however, has to follow rather stringent constraints regarding sensor characteristics as well as data processing and quality control routines. Owing to the fairly advanced state of oxygen sensor technology and the high scientific value of oceanic oxygen measurements (Gruber et al., 2010), an expansion of the Argo core mission to routine oxygen measurements is perhaps the most mature and promising candidate (Freeland et al., 2010). In this context, SCOR Working Group 142 “Quality Control Procedures for Oxygen and Other Biogeochemical Sensors on Floats and Gliders” (www.scor-int.org/SCOR_WGs_WG142.htm) set out in 2014 to assess the current status of biogeochemical sensor technology with particular emphasis on float-readiness, develop pre- and post-deployment quality control metrics and procedures for oxygen sensors, and to disseminate procedures widely to ensure rapid adoption in the community.
The use of mo and cu monochromatic radiations for quantitative phase analysis: study of the accuracy
Resumo:
Cement hydration is a very complex process in which crystalline phases are dissolving in water and after supersaturation hydrated crystalline and amorphous phases precipitate. Great efforts are being made to develop analytical tools to accurately quantify these processes and X-ray Powder Diffraction (XRPD) combined with Rietveld methodology is a suitable tool to quantify these complex mixtures and their time evolutions. However, some problems/drawbacks should be overcome to fully apply it to cement pastes characterization in order to get accurate phase analyses. In order to tackle this issue, a comparison of the Rietveld quantitative phase analyses (RQPA) obtained using Cu-Kα1, Mo-Kα1, and synchrotron strictly monochromatic radiations of three set of mixtures with increasing amounts of a given phase (spiking-method) is presented. The main aim is to test a simple hypothesis: high energy Mo-radiation, combined with high resolution laboratory X-ray powder diffraction optics, could yield more accurate RQPA, for challenging samples, than well-established Cu-radiation procedure(s). Firstly, a series of crystalline inorganic phase mixtures with increasing amounts of an analyte was studied in order to determine if Mo-Kα1 methodology is as robust as the well-established Cu-Kα1 one. Secondly, a series of crystalline organic phase mixtures with increasing amounts of an organic compound was analyzed. This type of mixture can result in transparency problems in reflection and inhomogeneous loading in narrow capillaries for transmission studies. Finally, a third series with variable amorphous content was studied. Limit of detection in Cu-patterns, ~0.2 wt%, are slightly lower than those derived from Mo-patterns, ~0.3 wt%, for similar recording times and limit of quantification for a well crystallized inorganic phase using laboratory powder diffraction was established ~0.10 wt%. From the obtained results it is inferred that RQPA from Mo-Kα1 radiation have slightly better accuracies than those obtained from Cu-Kα1. The results obtained in the previous comparison have been taken into account to obtain accurate RQPA, including the amorphous component with internal standard methodology, of hydrating cement pastes. The final goal of this second study was understanding the early-stage hydration mechanisms of a variety of cementing systems (Ordinary Portland Cement or Belite Alite Ye’elimite cement) as a function of water content, superplasticizer additives and type and content of sulfate source. In order to do so, X-ray powder diffraction data were taken in-situ with the humidity chamber coupled to the Mo-Kα1 powder diffractometer. Some results of this ongoing investigation will be reported and discussed.