Biblioteca Digital

87 resultados para Tridiagonal Kernel

OWDEAH: Online Web Data Extraction based on Access History

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Web data extraction systems are the kernel of information mediators between users and heterogeneous Web data resources. How to extract structured data from semi-structured documents has been a problem of active research. Supervised and unsupervised methods have been devised to learn extraction rules from training sets. However, trying to prepare training sets (especially to annotate them for supervised methods), is very time-consuming. We propose a framework for Web data extraction, which logged usersrsquo access history and exploit them to assist automatic training set generation. We cluster accessed Web documents according to their structural details; define criteria to measure the importance of sub-structures; and then generate extraction rules. We also propose a method to adjust the rules according to historical data. Our experiments confirm the viability of our proposal.

Spectral kernels for classification

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Spectral methods, as an unsupervised technique, have been used with success in data mining such as LSI in information retrieval, HITS and PageRank in Web search engines, and spectral clustering in machine learning. The essence of success in these applications is the spectral information that captures the semantics inherent in the large amount of data required during unsupervised learning. In this paper, we ask if spectral methods can also be used in supervised learning, e.g., classification. In an attempt to answer this question, our research reveals a novel kernel in which spectral clustering information can be easily exploited and extended to new incoming data during classification tasks. From our experimental results, the proposed Spectral Kernel has proved to speedup classification tasks without compromising accuracy.

Improved support vector machine generalization using normalized input space

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Data pre-processing always plays a key role in learning algorithm performance. In this research we consider data pre-processing by normalization for Support Vector Machines (SVMs). We examine the normalization affect across 112 classification problems with SVM using the rbf kernel. We observe a significant classification improvement due to normalization. Finally we suggest a rule based method to find when normalization is necessary for a specific classification problem. The best normalization method is also automatically selected by SVM itself.

Manage one beach or two? Movements and space-use of the threatened hooded plover (Thinornis rubricollis) in south-eastern Australia

Relevância:

10.00% 10.00%

Publicador:

Resumo:

An understanding of space use and dispersal of a wildlife species is essential for effective management. We examined the movements of a beach-dwelling, threatened population of hooded plover (Thinornis rubricollis) in southern central Victoria, Australia, by analysing sightings of colour-banded birds (4897 sightings; 194 birds tracked for up to 9 years). Most movements were relatively short (5050 ± 305 m), with 61.4% <1 km and 95.3% <20 km; they lacked directional or sexual bias. The extent of coastline used by individual birds was 47.8 ± 58.0 km. Regional differences in average distances moved by adults were apparent. For adults, movement rates (mean distance per day) were higher during the non-breeding season than during the breeding season. Non-breeding adults generally remained close to their partners (non-breeding, 456.3 ± 163.9 m; breeding, 148.2 ± 45.3 m). Largest flock sizes were recorded during the non-breeding period, and flocking was not uniformly distributed along the coast but appeared to be concentrated in particular locations. The frequency of pair cohesion (i.e. when the distance between partners was zero on a given day) was similar during the breeding (69.6%) and non-breeding seasons (67.7%). Breeding territories (kernel analysis) were 36.7 ± 5.7 ha and overlapped from year to year in all cases (23 pairwise comparisons; 47.9 ± 7.1% overlap). The high fidelity and constancy of territories confirms they warrant ongoing management investment, although the species relies on a matrix of breeding and non-breeding sites. The latter appear to occur in specific parts of the coast and warrant enhanced protection and more research attention. Fragmentation of the breeding population might occur where habitat is rendered unsuitable for > ~50 km.

Platform for reliable computing on clusters using group communications

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Shared clusters represent an excellent platform for the execution of parallel applications given their low price/performance ratio and the presence of cluster infrastructure in many organisations. The focus of recent research efforts are on parallelism management, transport and efficient access to resources, and making clusters easy to use. In this thesis, we examine reliable parallel computing on clusters. The aim of this research is to demonstrate the feasibility of developing an operating system facility providing transport fault tolerance using existing, enhanced and newly built operating system services for supporting parallel applications. In particular, we use existing process duplication and process migration services, and synthesise a group communications facility for use in a transparent checkpointing facility. This research is carried out using the methods of experimental computer science. To provide a foundation for the synthesis of the group communications and checkpointing facilities, we survey and review related work in both fields. For group communications, we examine the V Distributed System, the x-kernel and Psync, the ISIS Toolkit, and Horus. We identify a need for services that consider the placement of processes on computers in the cluster. For Checkpointing, we examine Manetho, KeyKOS, libckpt, and Diskless Checkpointing. We observe the use of remote computer memories for storing checkpoints, and the use of copy-on-write mechanisms to reduce the time to create a checkpoint of a process. We propose a group communications facility providing two sets of services: user-oriented services and system-oriented services. User-oriented services provide transparency and target application. System-oriented services supplement the user-oriented services for supporting other operating systems services and do not provide transparency. Additional flexibility is achieved by providing delivery and ordering semantics independently. An operating system facility providing transparent checkpointing is synthesised using coordinated checkpointing. To ensure a consistent set of checkpoints are generated by the facility, instead of blindly blocking the processes of a parallel application, only non-deterministic events are blocked. This allows the processes of the parallel application to continue execution during the checkpoint operation. Checkpoints are created by adapting process duplication mechanisms, and checkpoint data is transferred to remote computer memories and disk for storage using the mechanisms of process migration. The services of the group communications facility are used to coordinate the checkpoint operation, and to transport checkpoint data to remote computer memories and disk. Both the group communications facility and the checkpointing facility have been implemented in the GENESIS cluster operating system and provide proof-of-concept. GENESIS uses a microkernel and client-server based operating system architecture, and is demonstrated to provide an appropriate environment for the development of these facilities. We design a number of experiments to test the performance of both the group communications facility and checkpointing facility, and to provide proof-of-performance. We present our approach to testing, the challenges raised in testing the facilities, and how we overcome them. For group communications, we examine the performance of a number of delivery semantics. Good speed-ups are observed and system-oriented group communication services are shown to provide significant performance advantages over user-oriented semantics in the presence of packet loss. For checkpointing, we examine the scalability of the facility given different levels of resource usage and a variable number of computers. Low overheads are observed for checkpointing a parallel application. It is made clear by this research that the microkernel and client-server based cluster operating system provide an ideal environment for the development of a high performance group communications facility and a transparent checkpointing facility for generating a platform for reliable parallel computing on clusters.

A novel pseudonoise sequence for time-spread echo based audio watermarking

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper deals with the problem of digital audio watermarking using echo hiding. Compared to many other methods for audio watermarking, echo hiding techniques exhibit advantages in terms of relatively simple encoding and decoding, and robustness against common attacks. The low security issue existing in most echo hiding techniques is overcome in the timespread echo method by using pseudonoise (PN) sequence as a secret key. In this paper, we propose a novel sequence, in conjunction with a new decoding function, to improve the imperceptibility and the robustness of time-spread echo based audio watermarking. Theoretical analysis and simulation examples illustrate the effectiveness of the proposed sequence and decoding function.

Candidate working set strategy based SMO algorithm in support vector machine

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Sequential minimal optimization (SMO) is quite an efficient algorithm for training the support vector machine. The most important step of this algorithm is the selection of the working set, which greatly affects the training speed. The feasible direction strategy for the working set selection can decrease the objective function, however, may augment to the total calculation for selecting the working set in each of the iteration. In this paper, a new candidate working set (CWS) Strategy is presented considering the cost on the working set selection and cache performance. This new strategy can select several greatest violating samples from Cache as the iterative working sets for the next several optimizing steps, which can improve the efficiency of the kernel cache usage and reduce the computational cost related to the working set selection. The results of the theory analysis and experiments demonstrate that the proposed method can reduce the training time, especially on the large-scale datasets.

Finding coverage using incremental attribute combinations

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Coverage is the range that covers only positive samples in attribute (or feature) space. Finding coverage is the kernel problem in induction algorithms because of the fact that coverage can be used as rules to describe positive samples. To reflect the characteristic of training samples, it is desirable that the large coverage that cover more positive samples. However, it is difficult to find large coverage, because the attribute space is usually very high dimensionality. Many heuristic methods such as ID3, AQ and CN2 have been proposed to find large coverage. A robust algorithm also has been proposed to find the largest coverage, but the complexities of time and space are costly when the dimensionality becomes high. To overcome this drawback, this paper proposes an algorithm that adopts incremental feature combinations to effectively find the largest coverage. In this algorithm, the irrelevant coverage can be pruned away at early stages because potentially large coverage can be found earlier. Experiments show that the space and time needed to find the largest coverage has been significantly reduced.

Prediction of wool knitwear pilling propensity using support vector machines

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The propensity of wool knitwear to form entangled fiber balls, known as pills, on the surface is affected by a large number of factors. This study examines, for the first time, the application of the support vector machine (SVM) data mining tool to the pilling propensity prediction of wool knitwear. The results indicate that by using the binary classification method and the radial basis function (RBF) kernel function, the SVM is able to give high pilling propensity prediction accuracy for wool knitwear without data over-fitting. The study also found that the number of records available for each pill rating greatly affects the learning and prediction capability of SVM models.

A novel bipolar time-spread echo hiding based watermarking method for stereo audio signals

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper, a novel bipolar time-spread (TS) echo hiding based watermarking method is proposed for stereo audio signals, to overcome the low robustness problem in the traditional TS echo hiding method. At the embedding, echo signals with opposite polarities are added to both channels of the host audio signal. This improves the imperceptibility of the watermarking scheme, since added watermarks have similar effects in both channels. Then decoding part is developed, in order to improve the robustness of the watermarking scheme against common attacks. Since these novel embedding and decoding methods utilize the advantage of two channels in stereo audio signals, it significantly reduces the interference of host signal at watermark extraction which is the main reason for error detection in the traditional TS echo hiding based watermarking under closed-loop attack. The effectiveness of the proposed watermarking scheme is theoretically analyzed and verified by simulations under common attacks. The proposed echo hiding method outperforms conventional TS echo hiding based watermarking when their perceptual qualities are similar.

Effective pseudonoise sequence and decoding function for imperceptibility and robustness enhancement in time-spread echo-based audio watermarking

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper proposes an effective pseudonoise (PN) sequence and the corresponding decoding function for time-spread echo-based audio watermarking. Different from the traditional PN sequence used in time-spread echo hiding, the proposed PN sequence has two features. Firstly, the echo kernel resulting from the new PN sequence has frequency characteristics with smaller magnitudes in perceptually significant region. This leads to higher perceptual quality. Secondly, the correlation function of the new PN sequence has three times more large peaks than that of the existing PN sequence. Based on this feature, we propose a new decoding function to improve the robustness of time-spread echo-based audio watermarking. The effectiveness of the proposed PN sequence and decoding function is illustrated by theoretical analysis, simulation examples, and listening test.

Application of sequential nonparametric confidence bands in finance

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In a nonparametric setting, the functional form of the relationship between the response variable and the associated predictor variables is assumed to be unknown when data is fitted to the model. Non-parametric regression models can be used for the same types of applications such as estimation, prediction, calibration, and optimization that traditional regression models are used for. The main aim of nonparametric regression is to highlight an important structure in the data without any assumptions about the shape of an underlying regression function. Hence the nonparametric approach allows the data to speak for itself. Applications of sequential procedures to a nonparametric regression model at a given point are considered.

The primary goal of sequential analysis is to achieve a given accuracy by using the smallest possible sample sizes. These sequential procedures allow an experimenter to make decisions based on the smallest number of observations without compromising accuracy. In the nonparametric regression model with a random design based on independent and identically distributed pairs of observations (X ,Y ), where the regression function m(x) is given bym(x) = E(Y X = x), estimation of the Nadaraya-Watson kernel estimator (m (x)) NW and local linear kernel estimator (m (x)) LL for the curve m(x) is considered. In order to obtain asymptotic confidence intervals form(x), two stage sequential procedure is used under which some asymptotic properties of Nadaraya-Watson and local linear estimators have been obtained.

The proposed methodology is first tested with the help of simulated data from linear and nonlinear functions. Encouraged by the preliminary findings from simulation results, the proposed method is applied to estimate the nonparametric regression curve of CAPM.

CALD : surviving various application-layer DDoS attacks that mimic flash crowd

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Distributed denial of service (DDoS) attack is a continuous critical threat to the Internet. Derived from the low layers, new application-layer-based DDoS attacks utilizing legitimate HTTP requests to overwhelm victim resources are more undetectable. The case may be more serious when suchattacks mimic or occur during the flash crowd event of a popular Website. In this paper, we present the design and implementation of CALD, an architectural extension to protect Web servers against various DDoS attacks that masquerade as flash crowds. CALD provides real-time detection using mess tests but is different from other systems that use resembling methods. First, CALD uses a front-end sensor to monitor thetraffic that may contain various DDoS attacks or flash crowds. Intense pulse in the traffic means possible existence of anomalies because this is the basic property of DDoS attacks and flash crowds. Once abnormal traffic is identified, the sensor sends ATTENTION signal to activate the attack detection module. Second, CALD dynamically records the average frequency of each source IP and check the total mess extent. Theoretically, the mess extent of DDoS attacks is larger than the one of flash crowds. Thus, with some parameters from the attack detection module, the filter is capable of letting the legitimate requests through but the attack traffic stopped. Third, CALD may divide the security modules away from the Web servers. As a result, it keeps maximum performance on the kernel web services, regardless of the harassment from DDoS. In the experiments, the records from www.sina.com and www.taobao.com have proved the value of CALD.

Missing value estimation for mixed-attribute data sets

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Missing data imputation is a key issue in learning from incomplete data. Various techniques have been developed with great successes on dealing with missing values in data sets with homogeneous attributes (their independent attributes are all either continuous or discrete). This paper studies a new setting of missing data imputation, i.e., imputing missing data in data sets with heterogeneous attributes (their independent attributes are of different types), referred to as imputing mixed-attribute data sets. Although many real applications are in this setting, there is no estimator designed for imputing mixed-attribute data sets. This paper first proposes two consistent estimators for discrete and continuous missing target values, respectively. And then, a mixture-kernel-based iterative estimator is advocated to impute mixed-attribute data sets. The proposed method is evaluated with extensive experiments compared with some typical algorithms, and the result demonstrates that the proposed approach is better than these existing imputation methods in terms of classification accuracy and root mean square error (RMSE) at different missing ratios.

Properties of series feature aggregation schemes

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Feature aggregation is a critical technique in content-based image retrieval (CBIR) that combines multiple feature distances to obtain image dissimilarity. Conventional parallel feature aggregation (PFA) schemes failed to effectively filter out the irrelevant images using individual visual features before ranking images in collection. Series feature aggregation (SFA) is a new scheme that aims to address this problem. This paper investigates three important properties of SFA that are significant for design of systems. They reveal the irrelevance of feature order and the convertibility of SFA and PFA as well as the superior performance of SFA. Furthermore, based on Gaussian kernel density estimator, the authors propose a new method to estimate the visual threshold, which is the key parameter of SFA. Experiments, conducted with IAPR TC-12 benchmark image collection (ImageCLEF2006) that contains over 20,000 photographic images and defined queries, have shown that SFA can outperform conventional PFA schemes.

«
1
2
3
4
5
6
»