996 resultados para standard batch algorithms
Resumo:
Automatic Term Recognition (ATR) is a fundamental processing step preceding more complex tasks such as semantic search and ontology learning. From a large number of methodologies available in the literature only a few are able to handle both single and multi-word terms. In this paper we present a comparison of five such algorithms and propose a combined approach using a voting mechanism. We evaluated the six approaches using two different corpora and show how the voting algorithm performs best on one corpus (a collection of texts from Wikipedia) and less well using the Genia corpus (a standard life science corpus). This indicates that choice and design of corpus has a major impact on the evaluation of term recognition algorithms. Our experiments also showed that single-word terms can be equally important and occupy a fairly large proportion in certain domains. As a result, algorithms that ignore single-word terms may cause problems to tasks built on top of ATR. Effective ATR systems also need to take into account both the unstructured text and the structured aspects and this means information extraction techniques need to be integrated into the term recognition process.
Resumo:
A formalism for modelling the dynamics of Genetic Algorithms (GAs) using methods from statistical mechanics, originally due to Prugel-Bennett and Shapiro, is reviewed, generalized and improved upon. This formalism can be used to predict the averaged trajectory of macroscopic statistics describing the GA's population. These macroscopics are chosen to average well between runs, so that fluctuations from mean behaviour can often be neglected. Where necessary, non-trivial terms are determined by assuming maximum entropy with constraints on known macroscopics. Problems of realistic size are described in compact form and finite population effects are included, often proving to be of fundamental importance. The macroscopics used here are cumulants of an appropriate quantity within the population and the mean correlation (Hamming distance) within the population. Including the correlation as an explicit macroscopic provides a significant improvement over the original formulation. The formalism is applied to a number of simple optimization problems in order to determine its predictive power and to gain insight into GA dynamics. Problems which are most amenable to analysis come from the class where alleles within the genotype contribute additively to the phenotype. This class can be treated with some generality, including problems with inhomogeneous contributions from each site, non-linear or noisy fitness measures, simple diploid representations and temporally varying fitness. The results can also be applied to a simple learning problem, generalization in a binary perceptron, and a limit is identified for which the optimal training batch size can be determined for this problem. The theory is compared to averaged results from a real GA in each case, showing excellent agreement if the maximum entropy principle holds. Some situations where this approximation brakes down are identified. In order to fully test the formalism, an attempt is made on the strong sc np-hard problem of storing random patterns in a binary perceptron. Here, the relationship between the genotype and phenotype (training error) is strongly non-linear. Mutation is modelled under the assumption that perceptron configurations are typical of perceptrons with a given training error. Unfortunately, this assumption does not provide a good approximation in general. It is conjectured that perceptron configurations would have to be constrained by other statistics in order to accurately model mutation for this problem. Issues arising from this study are discussed in conclusion and some possible areas of further research are outlined.
Resumo:
Magnification factors specify the extent to which the area of a small patch of the latent (or `feature') space of a topographic mapping is magnified on projection to the data space, and are of considerable interest in both neuro-biological and data analysis contexts. Previous attempts to consider magnification factors for the self-organizing map (SOM) algorithm have been hindered because the mapping is only defined at discrete points (given by the reference vectors). In this paper we consider the batch version of SOM, for which a continuous mapping can be defined, as well as the Generative Topographic Mapping (GTM) algorithm of Bishop et al. (1997) which has been introduced as a probabilistic formulation of the SOM. We show how the techniques of differential geometry can be used to determine magnification factors as continuous functions of the latent space coordinates. The results are illustrated here using a problem involving the identification of crab species from morphological data.
Resumo:
A theoretical model is presented which describes selection in a genetic algorithm (GA) under a stochastic fitness measure and correctly accounts for finite population effects. Although this model describes a number of selection schemes, we only consider Boltzmann selection in detail here as results for this form of selection are particularly transparent when fitness is corrupted by additive Gaussian noise. Finite population effects are shown to be of fundamental importance in this case, as the noise has no effect in the infinite population limit. In the limit of weak selection we show how the effects of any Gaussian noise can be removed by increasing the population size appropriately. The theory is tested on two closely related problems: the one-max problem corrupted by Gaussian noise and generalization in a perceptron with binary weights. The averaged dynamics can be accurately modelled for both problems using a formalism which describes the dynamics of the GA using methods from statistical mechanics. The second problem is a simple example of a learning problem and by considering this problem we show how the accurate characterization of noise in the fitness evaluation may be relevant in machine learning. The training error (negative fitness) is the number of misclassified training examples in a batch and can be considered as a noisy version of the generalization error if an independent batch is used for each evaluation. The noise is due to the finite batch size and in the limit of large problem size and weak selection we show how the effect of this noise can be removed by increasing the population size. This allows the optimal batch size to be determined, which minimizes computation time as well as the total number of training examples required.
Resumo:
A review of the literature of work carried out on dextransucrase production, purification, immobilization and reactions has been carried out. A brief review has also been made of the literature concerning general enzyme biotechnology and fermentation technology. Fed-batch fermentation of the bacteria Leuconostoc mesenteroides NRRL B512 (F) to produce dextransucrase has formed the major part of this research. Aerobic and anaerobic fermentations have been studied using a 16 litre New Brunswick fermenter which has a 3-12 litre working volume. The initial volume of broth used in the studies was 6 litres. The results of the fed-batch fermentations showed for the first time that yields of dextransucrase are much higher under the anaerobic conditions than during the aerobic fermentations. Dextransucrase containing 300-350 DSU/cm3 of enzyme activity has been obtained during the aerobic fermentations, while in the anaerobic fermentations, enzyme yields containing 450-500 DSU/cm3 have been obtained routinely. The type of yeast extract used in the fermentation medium has been found to have significant effects on enzyme yield. Of the different types studied, the Gistex Standard was found to be the type that favoured the highest enzyme production. Studies have also been carried out on the effect of agitation rate and antifoam on the enzyme production during the anaerobic experiments. Agitation rates of up to 600 rpm were found not to affect the enzyme yield, however, the presence of antifoam in the medium led to a significant reduction in enzyme activity (less than 300 DSU/cm3). Scale-up of the anaerobic fermentations has been performed at up to the 1000 litre level with enzyme yields containing more than 400 DSU/cm3 of activity being produced. Some of the enzyme produced at this scale was used for the first time to produce dextran on an industrial scale via the enzyme route, with up to 99% conversion of sucrose to dextran being obtained. An attempt has been made at continuous dextransucrase production. Cell washout was observed to occur at dilution rates of greater than 0.4 h-1. Dextransucrase containing up to 25 DSU/cm3/h has been produced continuously.
Resumo:
Orthogonal frequency division multiplexing (OFDM) is becoming a fundamental technology in future generation wireless communications. Call admission control is an effective mechanism to guarantee resilient, efficient, and quality-of-service (QoS) services in wireless mobile networks. In this paper, we present several call admission control algorithms for OFDM-based wireless multiservice networks. Call connection requests are differentiated into narrow-band calls and wide-band calls. For either class of calls, the traffic process is characterized as batch arrival since each call may request multiple subcarriers to satisfy its QoS requirement. The batch size is a random variable following a probability mass function (PMF) with realistically maximum value. In addition, the service times for wide-band and narrow-band calls are different. Following this, we perform a tele-traffic queueing analysis for OFDM-based wireless multiservice networks. The formulae for the significant performance metrics call blocking probability and bandwidth utilization are developed. Numerical investigations are presented to demonstrate the interaction between key parameters and performance metrics. The performance tradeoff among different call admission control algorithms is discussed. Moreover, the analytical model has been validated by simulation. The methodology as well as the result provides an efficient tool for planning next-generation OFDM-based broadband wireless access systems.
Resumo:
The polyparametric intelligence information system for diagnostics human functional state in medicine and public health is developed. The essence of the system consists in polyparametric describing of human functional state with the unified set of physiological parameters and using the polyparametric cognitive model developed as the tool for a system analysis of multitude data and diagnostics of a human functional state. The model is developed on the basis of general principles geometry and symmetry by algorithms of artificial intelligence systems. The architecture of the system is represented. The model allows analyzing traditional signs - absolute values of electrophysiological parameters and new signs generated by the model – relationships of ones. The classification of physiological multidimensional data is made with a transformer of the model. The results are presented to a physician in a form of visual graph – a pattern individual functional state. This graph allows performing clinical syndrome analysis. A level of human functional state is defined in the case of the developed standard (“ideal”) functional state. The complete formalization of results makes it possible to accumulate physiological data and to analyze them by mathematics methods.
Resumo:
The problem of transit points arrangement is presented in the paper. This issue is connected with accuracy of tariff distance calculation and it is the urgent problem at present. Was showed that standard method of tariff distance discovering is not optimal. The Genetic Algorithms are used in optimization problem resolution. The UML application class diagram and class content are showed. In the end the example of transit points arrangement is represented.
Resumo:
In this paper a genetic algorithm (GA) is applied on Maximum Betweennes Problem (MBP). The maximum of the objective function is obtained by finding a permutation which satisfies a maximal number of betweenness constraints. Every permutation considered is genetically coded with an integer representation. Standard operators are used in the GA. Instances in the experimental results are randomly generated. For smaller dimensions, optimal solutions of MBP are obtained by total enumeration. For those instances, the GA reached all optimal solutions except one. The GA also obtained results for larger instances of up to 50 elements and 1000 triples. The running time of execution and finding optimal results is quite short.
Resumo:
A job shop with one batch processing and several discrete machines is analyzed. Given a set of jobs, their process routes, processing requirements, and size, the objective is to schedule the jobs such that the makespan is minimized. The batch processing machine can process a batch of jobs as long as the machine capacity is not violated. The batch processing time is equal to the longest processing job in the batch. The problem under study can be represented as Jm:batch:Cmax. If no batches were formed, the scheduling problem under study reduces to the classical job shop scheduling problem (i.e. Jm:: Cmax), which is known to be NP-hard. This research extends the scheduling literature by combining Jm::Cmax with batch processing. The primary contributions are the mathematical formulation, a new network representation and several solution approaches. The problem under study is observed widely in metal working and other industries, but received limited or no attention due to its complexity. A novel network representation of the problem using disjunctive and conjunctive arcs, and a mathematical formulation are proposed to minimize the makespan. Besides that, several algorithms, like batch forming heuristics, dispatching rules, Modified Shifting Bottleneck, Tabu Search (TS) and Simulated Annealing (SA), were developed and implemented. An experimental study was conducted to evaluate the proposed heuristics, and the results were compared to those from a commercial solver (i.e., CPLEX). TS and SA, with the combination of MWKR-FF as the initial solution, gave the best solutions among all the heuristics proposed. Their results were close to CPLEX; and for some larger instances, with total operations greater than 225, they were competitive in terms of solution quality and runtime. For some larger problem instances, CPLEX was unable to report a feasible solution even after running for several hours. Between SA and the experimental study indicated that SA produced a better average Cmax for all instances. The solution approaches proposed will benefit practitioners to schedule a job shop (with both discrete and batch processing machines) more efficiently. The proposed solution approaches are easier to implement and requires short run times to solve large problem instances.
Resumo:
This research aims at a study of the hybrid flow shop problem which has parallel batch-processing machines in one stage and discrete-processing machines in other stages to process jobs of arbitrary sizes. The objective is to minimize the makespan for a set of jobs. The problem is denoted as: FF: batch1,sj:Cmax. The problem is formulated as a mixed-integer linear program. The commercial solver, AMPL/CPLEX, is used to solve problem instances to their optimality. Experimental results show that AMPL/CPLEX requires considerable time to find the optimal solution for even a small size problem, i.e., a 6-job instance requires 2 hours in average. A bottleneck-first-decomposition heuristic (BFD) is proposed in this study to overcome the computational (time) problem encountered while using the commercial solver. The proposed BFD heuristic is inspired by the shifting bottleneck heuristic. It decomposes the entire problem into three sub-problems, and schedules the sub-problems one by one. The proposed BFD heuristic consists of four major steps: formulating sub-problems, prioritizing sub-problems, solving sub-problems and re-scheduling. For solving the sub-problems, two heuristic algorithms are proposed; one for scheduling a hybrid flow shop with discrete processing machines, and the other for scheduling parallel batching machines (single stage). Both consider job arrival and delivery times. An experiment design is conducted to evaluate the effectiveness of the proposed BFD, which is further evaluated against a set of common heuristics including a randomized greedy heuristic and five dispatching rules. The results show that the proposed BFD heuristic outperforms all these algorithms. To evaluate the quality of the heuristic solution, a procedure is developed to calculate a lower bound of makespan for the problem under study. The lower bound obtained is tighter than other bounds developed for related problems in literature. A meta-search approach based on the Genetic Algorithm concept is developed to evaluate the significance of further improving the solution obtained from the proposed BFD heuristic. The experiment indicates that it reduces the makespan by 1.93 % in average within a negligible time when problem size is less than 50 jobs.
Resumo:
With the popularization of GPS-enabled devices such as mobile phones, location data are becoming available at an unprecedented scale. The locations may be collected from many different sources such as vehicles moving around a city, user check-ins in social networks, and geo-tagged micro-blogging photos or messages. Besides the longitude and latitude, each location record may also have a timestamp and additional information such as the name of the location. Time-ordered sequences of these locations form trajectories, which together contain useful high-level information about people's movement patterns.
The first part of this thesis focuses on a few geometric problems motivated by the matching and clustering of trajectories. We first give a new algorithm for computing a matching between a pair of curves under existing models such as dynamic time warping (DTW). The algorithm is more efficient than standard dynamic programming algorithms both theoretically and practically. We then propose a new matching model for trajectories that avoids the drawbacks of existing models. For trajectory clustering, we present an algorithm that computes clusters of subtrajectories, which correspond to common movement patterns. We also consider trajectories of check-ins, and propose a statistical generative model, which identifies check-in clusters as well as the transition patterns between the clusters.
The second part of the thesis considers the problem of covering shortest paths in a road network, motivated by an EV charging station placement problem. More specifically, a subset of vertices in the road network are selected to place charging stations so that every shortest path contains enough charging stations and can be traveled by an EV without draining the battery. We first introduce a general technique for the geometric set cover problem. This technique leads to near-linear-time approximation algorithms, which are the state-of-the-art algorithms for this problem in either running time or approximation ratio. We then use this technique to develop a near-linear-time algorithm for this
shortest-path cover problem.
Resumo:
The presented thesis was written in the frame of a project called 'seepage water prognosis'. It was funded by the Federal Ministry for Education and Science (BMBF). 41 German institutions among them research institutes of universities, public authorities and engineering companies were financed for three years respectively. The aim was to work out the scientific basis that is needed to carry out a seepage water prognosis (Oberacker und Eberle, 2002). According to the Federal German Soil Protection Act (Federal Bulletin, 1998) a seepage water prognosis is required in order to avoid future soil impacts from the application of recycling products. The participants focused on the development of either methods to determine the source strength of the materials investigated, which is defined as the total mass flow caused by natural leaching or on models to predict the contaminants transport through the underlying soil. Annual meetings of all participants as well as separate meetings of the two subprojects were held. The department of Geosciences in Bremen participated with two subprojects. The aim of the subproject that resulted in this thesis was the development of easily applicable, valid, and generally accepted laboratory methods for the determination of the source strength. In the scope of the second subproject my colleague Veith Becker developed a computer model for the transport prognosis with the source strength as the main input parameter.
Resumo:
There has been an increasing interest in the development of new methods using Pareto optimality to deal with multi-objective criteria (for example, accuracy and time complexity). Once one has developed an approach to a problem of interest, the problem is then how to compare it with the state of art. In machine learning, algorithms are typically evaluated by comparing their performance on different data sets by means of statistical tests. Standard tests used for this purpose are able to consider jointly neither performance measures nor multiple competitors at once. The aim of this paper is to resolve these issues by developing statistical procedures that are able to account for multiple competing measures at the same time and to compare multiple algorithms altogether. In particular, we develop two tests: a frequentist procedure based on the generalized likelihood-ratio test and a Bayesian procedure based on a multinomial-Dirichlet conjugate model. We further extend them by discovering conditional independences among measures to reduce the number of parameters of such models, as usually the number of studied cases is very reduced in such comparisons. Data from a comparison among general purpose classifiers is used to show a practical application of our tests.
Resumo:
Background
It is generally acknowledged that a functional understanding of a biological system can only be obtained by an understanding of the collective of molecular interactions in form of biological networks. Protein networks are one particular network type of special importance, because proteins form the functional base units of every biological cell. On a mesoscopic level of protein networks, modules are of significant importance because these building blocks may be the next elementary functional level above individual proteins allowing to gain insight into fundamental organizational principles of biological cells.
Results
In this paper, we provide a comparative analysis of five popular and four novel module detection algorithms. We study these module prediction methods for simulated benchmark networks as well as 10 biological protein interaction networks (PINs). A particular focus of our analysis is placed on the biological meaning of the predicted modules by utilizing the Gene Ontology (GO) database as gold standard for the definition of biological processes. Furthermore, we investigate the robustness of the results by perturbing the PINs simulating in this way our incomplete knowledge of protein networks.
Conclusions
Overall, our study reveals that there is a large heterogeneity among the different module prediction algorithms if one zooms-in the biological level of biological processes in the form of GO terms and all methods are severely affected by a slight perturbation of the networks. However, we also find pathways that are enriched in multiple modules, which could provide important information about the hierarchical organization of the system