858 resultados para optimization-based similarity reasoning
Resumo:
The goal of the work reported here is to capture the commonsense knowledge of non-expert human contributors. Achieving this goal will enable more intelligent human-computer interfaces and pave the way for computers to reason about our world. In the domain of natural language processing, it will provide the world knowledge much needed for semantic processing of natural language. To acquire knowledge from contributors not trained in knowledge engineering, I take the following four steps: (i) develop a knowledge representation (KR) model for simple assertions in natural language, (ii) introduce cumulative analogy, a class of nearest-neighbor based analogical reasoning algorithms over this representation, (iii) argue that cumulative analogy is well suited for knowledge acquisition (KA) based on a theoretical analysis of effectiveness of KA with this approach, and (iv) test the KR model and the effectiveness of the cumulative analogy algorithms empirically. To investigate effectiveness of cumulative analogy for KA empirically, Learner, an open source system for KA by cumulative analogy has been implemented, deployed, and evaluated. (The site "1001 Questions," is available at http://teach-computers.org/learner.html). Learner acquires assertion-level knowledge by constructing shallow semantic analogies between a KA topic and its nearest neighbors and posing these analogies as natural language questions to human contributors. Suppose, for example, that based on the knowledge about "newspapers" already present in the knowledge base, Learner judges "newspaper" to be similar to "book" and "magazine." Further suppose that assertions "books contain information" and "magazines contain information" are also already in the knowledge base. Then Learner will use cumulative analogy from the similar topics to ask humans whether "newspapers contain information." Because similarity between topics is computed based on what is already known about them, Learner exhibits bootstrapping behavior --- the quality of its questions improves as it gathers more knowledge. By summing evidence for and against posing any given question, Learner also exhibits noise tolerance, limiting the effect of incorrect similarities. The KA power of shallow semantic analogy from nearest neighbors is one of the main findings of this thesis. I perform an analysis of commonsense knowledge collected by another research effort that did not rely on analogical reasoning and demonstrate that indeed there is sufficient amount of correlation in the knowledge base to motivate using cumulative analogy from nearest neighbors as a KA method. Empirically, evaluating the percentages of questions answered affirmatively, negatively and judged to be nonsensical in the cumulative analogy case compares favorably with the baseline, no-similarity case that relies on random objects rather than nearest neighbors. Of the questions generated by cumulative analogy, contributors answered 45% affirmatively, 28% negatively and marked 13% as nonsensical; in the control, no-similarity case 8% of questions were answered affirmatively, 60% negatively and 26% were marked as nonsensical.
Resumo:
This paper describes a novel template-based meshing approach for generating good quality quadrilateral meshes from 2D digital images. This approach builds upon an existing image-based mesh generation technique called Imeshp, which enables us to create a segmented triangle mesh from an image without the need for an image segmentation step. Our approach generates a quadrilateral mesh using an indirect scheme, which converts the segmented triangle mesh created by the initial steps of the Imesh technique into a quadrilateral one. The triangle-to-quadrilateral conversion makes use of template meshes of triangles. To ensure good element quality, the conversion step is followed by a smoothing step, which is based on a new optimization-based procedure. We show several examples of meshes generated by our approach, and present a thorough experimental evaluation of the quality of the meshes given as examples.
Resumo:
Since the beginning, some pattern recognition techniques have faced the problem of high computational burden for dataset learning. Among the most widely used techniques, we may highlight Support Vector Machines (SVM), which have obtained very promising results for data classification. However, this classifier requires an expensive training phase, which is dominated by a parameter optimization that aims to make SVM less prone to errors over the training set. In this paper, we model the problem of finding such parameters as a metaheuristic-based optimization task, which is performed through Harmony Search (HS) and some of its variants. The experimental results have showen the robustness of HS-based approaches for such task in comparison against with an exhaustive (grid) search, and also a Particle Swarm Optimization-based implementation.
Resumo:
Genome predictions based on selected genes would be a very welcome approach for taxonomic studies, including DNA-DNA similarity, G+C content and representative phylogeny of bacteria. At present, DNA-DNA hybridizations are still considered the gold standard in species descriptions. However, this method is time-consuming and troublesome, and datasets can vary significantly between experiments as well as between laboratories. For the same reasons, full matrix hybridizations are rarely performed, weakening the significance of the results obtained. The authors established a universal sequencing approach for the three genes recN, rpoA and thdF for the Pasteurellaceae, and determined if the sequences could be used for predicting DNA-DNA relatedness within the family. The sequence-based similarity values calculated using a previously published formula proved most useful for species and genus separation, indicating that this method provides better resolution and no experimental variation compared to hybridization. By this method, cross-comparisons within the family over species and genus borders easily become possible. The three genes also serve as an indicator of the genome G+C content of a species. A mean divergence of around 1 % was observed from the classical method, which in itself has poor reproducibility. Finally, the three genes can be used alone or in combination with already-established 16S rRNA, rpoB and infB gene-sequencing strategies in a multisequence-based phylogeny for the family Pasteurellaceae. It is proposed to use the three sequences as a taxonomic tool, replacing DNA-DNA hybridization.
Resumo:
The new user cold start issue represents a serious problem in recommender systems as it can lead to the loss of new users who decide to stop using the system due to the lack of accuracy in the recommenda- tions received in that first stage in which they have not yet cast a significant number of votes with which to feed the recommender system?s collaborative filtering core. For this reason it is particularly important to design new similarity metrics which provide greater precision in the results offered to users who have cast few votes. This paper presents a new similarity measure perfected using optimization based on neu- ral learning, which exceeds the best results obtained with current metrics. The metric has been tested on the Netflix and Movielens databases, obtaining important improvements in the measures of accuracy, precision and recall when applied to new user cold start situations. The paper includes the mathematical formalization describing how to obtain the main quality measures of a recommender system using leave- one-out cross validation.
Resumo:
Probabilistic modeling is the de�ning characteristic of estimation of distribution algorithms (EDAs) which determines their behavior and performance in optimization. Regularization is a well-known statistical technique used for obtaining an improved model by reducing the generalization error of estimation, especially in high-dimensional problems. `1-regularization is a type of this technique with the appealing variable selection property which results in sparse model estimations. In this thesis, we study the use of regularization techniques for model learning in EDAs. Several methods for regularized model estimation in continuous domains based on a Gaussian distribution assumption are presented, and analyzed from di�erent aspects when used for optimization in a high-dimensional setting, where the population size of EDA has a logarithmic scale with respect to the number of variables. The optimization results obtained for a number of continuous problems with an increasing number of variables show that the proposed EDA based on regularized model estimation performs a more robust optimization, and is able to achieve signi�cantly better results for larger dimensions than other Gaussian-based EDAs. We also propose a method for learning a marginally factorized Gaussian Markov random �eld model using regularization techniques and a clustering algorithm. The experimental results show notable optimization performance on continuous additively decomposable problems when using this model estimation method. Our study also covers multi-objective optimization and we propose joint probabilistic modeling of variables and objectives in EDAs based on Bayesian networks, speci�cally models inspired from multi-dimensional Bayesian network classi�ers. It is shown that with this approach to modeling, two new types of relationships are encoded in the estimated models in addition to the variable relationships captured in other EDAs: objectivevariable and objective-objective relationships. An extensive experimental study shows the e�ectiveness of this approach for multi- and many-objective optimization. With the proposed joint variable-objective modeling, in addition to the Pareto set approximation, the algorithm is also able to obtain an estimation of the multi-objective problem structure. Finally, the study of multi-objective optimization based on joint probabilistic modeling is extended to noisy domains, where the noise in objective values is represented by intervals. A new version of the Pareto dominance relation for ordering the solutions in these problems, namely �-degree Pareto dominance, is introduced and its properties are analyzed. We show that the ranking methods based on this dominance relation can result in competitive performance of EDAs with respect to the quality of the approximated Pareto sets. This dominance relation is then used together with a method for joint probabilistic modeling based on `1-regularization for multi-objective feature subset selection in classi�cation, where six di�erent measures of accuracy are considered as objectives with interval values. The individual assessment of the proposed joint probabilistic modeling and solution ranking methods on datasets with small-medium dimensionality, when using two di�erent Bayesian classi�ers, shows that comparable or better Pareto sets of feature subsets are approximated in comparison to standard methods.
Resumo:
This paper focuses on the general problem of coordinating multiple robots. More specifically, it addresses the self-election of heterogeneous specialized tasks by autonomous robots. In this paper we focus on a specifically distributed or decentralized approach as we are particularly interested on decentralized solution where the robots themselves autonomously and in an individual manner, are responsible of selecting a particular task so that all the existing tasks are optimally distributed and executed. In this regard, we have established an experimental scenario to solve the corresponding multi-tasks distribution problem and we propose a solution using two different approaches by applying Ant Colony Optimization-based deterministic algorithms as well as Learning Automata-based probabilistic algorithms. We have evaluated the robustness of the algorithm, perturbing the number of pending loads to simulate the robot’s error in estimating the real number of pending tasks and also the dynamic generation of loads through time. The paper ends with a critical discussion of experimental results.
Resumo:
Urban growth and change presents numerous challenges for planners and policy makers. Effective and appropriate strategies for managing growth and change must address issues of social, environmental and economic sustainability. Doing so in practical terms is a difficult task given the uncertainty associated with likely growth trends not to mention the uncertainty associated with how social and environmental structures will respond to such change. An optimization based approach is developed for evaluating growth and change based upon spatial restrictions and impact thresholds. The spatial optimization model is integrated with a cellular automata growth simulation process. Application results are presented and discussed with respect to possible growth scenarios in south east Queensland, Australia.
Resumo:
Short text messages a.k.a Microposts (e.g. Tweets) have proven to be an effective channel for revealing information about trends and events, ranging from those related to Disaster (e.g. hurricane Sandy) to those related to Violence (e.g. Egyptian revolution). Being informed about such events as they occur could be extremely important to authorities and emergency professionals by allowing such parties to immediately respond. In this work we study the problem of topic classification (TC) of Microposts, which aims to automatically classify short messages based on the subject(s) discussed in them. The accurate TC of Microposts however is a challenging task since the limited number of tokens in a post often implies a lack of sufficient contextual information. In order to provide contextual information to Microposts, we present and evaluate several graph structures surrounding concepts present in linked knowledge sources (KSs). Traditional TC techniques enrich the content of Microposts with features extracted only from the Microposts content. In contrast our approach relies on the generation of different weighted semantic meta-graphs extracted from linked KSs. We introduce a new semantic graph, called category meta-graph. This novel meta-graph provides a more fine grained categorisation of concepts providing a set of novel semantic features. Our findings show that such category meta-graph features effectively improve the performance of a topic classifier of Microposts. Furthermore our goal is also to understand which semantic feature contributes to the performance of a topic classifier. For this reason we propose an approach for automatic estimation of accuracy loss of a topic classifier on new, unseen Microposts. We introduce and evaluate novel topic similarity measures, which capture the similarity between the KS documents and Microposts at a conceptual level, considering the enriched representation of these documents. Extensive evaluation in the context of Emergency Response (ER) and Violence Detection (VD) revealed that our approach outperforms previous approaches using single KS without linked data and Twitter data only up to 31.4% in terms of F1 measure. Our main findings indicate that the new category graph contains useful information for TC and achieves comparable results to previously used semantic graphs. Furthermore our results also indicate that the accuracy of a topic classifier can be accurately predicted using the enhanced text representation, outperforming previous approaches considering content-based similarity measures. © 2014 Elsevier B.V. All rights reserved.
Resumo:
A High-Performance Computing job dispatcher is a critical software that assigns the finite computing resources to submitted jobs. This resource assignment over time is known as the on-line job dispatching problem in HPC systems. The fact the problem is on-line means that solutions must be computed in real-time, and their required time cannot exceed some threshold to do not affect the normal system functioning. In addition, a job dispatcher must deal with a lot of uncertainty: submission times, the number of requested resources, and duration of jobs. Heuristic-based techniques have been broadly used in HPC systems, at the cost of achieving (sub-)optimal solutions in a short time. However, the scheduling and resource allocation components are separated, thus generates a decoupled decision that may cause a performance loss. Optimization-based techniques are less used for this problem, although they can significantly improve the performance of HPC systems at the expense of higher computation time. Nowadays, HPC systems are being used for modern applications, such as big data analytics and predictive model building, that employ, in general, many short jobs. However, this information is unknown at dispatching time, and job dispatchers need to process large numbers of them quickly while ensuring high Quality-of-Service (QoS) levels. Constraint Programming (CP) has been shown to be an effective approach to tackle job dispatching problems. However, state-of-the-art CP-based job dispatchers are unable to satisfy the challenges of on-line dispatching, such as generate dispatching decisions in a brief period and integrate current and past information of the housing system. Given the previous reasons, we propose CP-based dispatchers that are more suitable for HPC systems running modern applications, generating on-line dispatching decisions in a proper time and are able to make effective use of job duration predictions to improve QoS levels, especially for workloads dominated by short jobs.
Resumo:
Trabalho Final de Mestrado para obtenção do grau de Mestre em Engenharia de Electrónica e Telecomunicações
Resumo:
To meet the increasing demands of the complex inter-organizational processes and the demand for continuous innovation and internationalization, it is evident that new forms of organisation are being adopted, fostering more intensive collaboration processes and sharing of resources, in what can be called collaborative networks (Camarinha-Matos, 2006:03). Information and knowledge are crucial resources in collaborative networks, being their management fundamental processes to optimize. Knowledge organisation and collaboration systems are thus important instruments for the success of collaborative networks of organisations having been researched in the last decade in the areas of computer science, information science, management sciences, terminology and linguistics. Nevertheless, research in this area didn’t give much attention to multilingual contexts of collaboration, which pose specific and challenging problems. It is then clear that access to and representation of knowledge will happen more and more on a multilingual setting which implies the overcoming of difficulties inherent to the presence of multiple languages, through the use of processes like localization of ontologies. Although localization, like other processes that involve multilingualism, is a rather well-developed practice and its methodologies and tools fruitfully employed by the language industry in the development and adaptation of multilingual content, it has not yet been sufficiently explored as an element of support to the development of knowledge representations - in particular ontologies - expressed in more than one language. Multilingual knowledge representation is then an open research area calling for cross-contributions from knowledge engineering, terminology, ontology engineering, cognitive sciences, computational linguistics, natural language processing, and management sciences. This workshop joined researchers interested in multilingual knowledge representation, in a multidisciplinary environment to debate the possibilities of cross-fertilization between knowledge engineering, terminology, ontology engineering, cognitive sciences, computational linguistics, natural language processing, and management sciences applied to contexts where multilingualism continuously creates new and demanding challenges to current knowledge representation methods and techniques. In this workshop six papers dealing with different approaches to multilingual knowledge representation are presented, most of them describing tools, approaches and results obtained in the development of ongoing projects. In the first case, Andrés Domínguez Burgos, Koen Kerremansa and Rita Temmerman present a software module that is part of a workbench for terminological and ontological mining, Termontospider, a wiki crawler that aims at optimally traverse Wikipedia in search of domainspecific texts for extracting terminological and ontological information. The crawler is part of a tool suite for automatically developing multilingual termontological databases, i.e. ontologicallyunderpinned multilingual terminological databases. In this paper the authors describe the basic principles behind the crawler and summarized the research setting in which the tool is currently tested. In the second paper, Fumiko Kano presents a work comparing four feature-based similarity measures derived from cognitive sciences. The purpose of the comparative analysis presented by the author is to verify the potentially most effective model that can be applied for mapping independent ontologies in a culturally influenced domain. For that, datasets based on standardized pre-defined feature dimensions and values, which are obtainable from the UNESCO Institute for Statistics (UIS) have been used for the comparative analysis of the similarity measures. The purpose of the comparison is to verify the similarity measures based on the objectively developed datasets. According to the author the results demonstrate that the Bayesian Model of Generalization provides for the most effective cognitive model for identifying the most similar corresponding concepts existing for a targeted socio-cultural community. In another presentation, Thierry Declerck, Hans-Ulrich Krieger and Dagmar Gromann present an ongoing work and propose an approach to automatic extraction of information from multilingual financial Web resources, to provide candidate terms for building ontology elements or instances of ontology concepts. The authors present a complementary approach to the direct localization/translation of ontology labels, by acquiring terminologies through the access and harvesting of multilingual Web presences of structured information providers in the field of finance, leading to both the detection of candidate terms in various multilingual sources in the financial domain that can be used not only as labels of ontology classes and properties but also for the possible generation of (multilingual) domain ontologies themselves. In the next paper, Manuel Silva, António Lucas Soares and Rute Costa claim that despite the availability of tools, resources and techniques aimed at the construction of ontological artifacts, developing a shared conceptualization of a given reality still raises questions about the principles and methods that support the initial phases of conceptualization. These questions become, according to the authors, more complex when the conceptualization occurs in a multilingual setting. To tackle these issues the authors present a collaborative platform – conceptME - where terminological and knowledge representation processes support domain experts throughout a conceptualization framework, allowing the inclusion of multilingual data as a way to promote knowledge sharing and enhance conceptualization and support a multilingual ontology specification. In another presentation Frieda Steurs and Hendrik J. Kockaert present us TermWise, a large project dealing with legal terminology and phraseology for the Belgian public services, i.e. the translation office of the ministry of justice, a project which aims at developing an advanced tool including expert knowledge in the algorithms that extract specialized language from textual data (legal documents) and whose outcome is a knowledge database including Dutch/French equivalents for legal concepts, enriched with the phraseology related to the terms under discussion. Finally, Deborah Grbac, Luca Losito, Andrea Sada and Paolo Sirito report on the preliminary results of a pilot project currently ongoing at UCSC Central Library, where they propose to adapt to subject librarians, employed in large and multilingual Academic Institutions, the model used by translators working within European Union Institutions. The authors are using User Experience (UX) Analysis in order to provide subject librarians with a visual support, by means of “ontology tables” depicting conceptual linking and connections of words with concepts presented according to their semantic and linguistic meaning. The organizers hope that the selection of papers presented here will be of interest to a broad audience, and will be a starting point for further discussion and cooperation.
Resumo:
Background: CYP2D6 is the key enzyme responsible for tamoxifen bioactivation mainly into endoxifen. This gene is highly polymorphic and breast cancer patients classified as CYP2D6 poor metabolizers (PM) or intermediate metabolizers (IM) appear to show low concentrations of endoxifen and to achieve less benefit from tamoxifen treatment. Purpose: This prospective, open-label trial aimed to assess how the increase of tamoxifen dose influences the level of endoxifen in the different genotype groups (poor-, intermediate-, and extensive-metabolizers (EM)). We examined the impact of doubling tamoxifen dose to 20mg twice daily on endoxifen plasma concentrations across these genotype groups. Patients and methods: Patients were assayed for CYP2D6 genotype and phenotype using dextromethorphan test. Tamoxifen, N-desmethyltamoxifen, 4-hydroxytamoxifen and endoxifen plasma levels were determined on 2 occasions at baseline (20mg/day of tamoxifen) and at day 30, 90 and 120 after dose increase (20 mg twice daily) using liquid chromatography-tandem-mass spectrometry. Endoxifen plasma levels were measured 6 to 24 hours after last drug intake to evaluate its accumulation before and after doubling tamoxifen dosage. ANOVA was used to evaluate endoxifen levels increase and difference between genotype groups. Results: 63 patients are available for analysis to date. Tamoxifen, N-desmethyltamoxifen, 4-hydroxytamoxifen and endoxifen plasma reached steady state at 30 day after tamoxifen dose escalation, with a significant increase compared to baseline by 1.6 to 1.8 fold : geometric mean plasma concentrations (CV %) were 140 ng/mL (45%) at baseline vs 255 (47%) at day 30 for tamoxifen (P < 0.0001); 256 (49%) vs 408 (64%) for N-desmethyltamoxifen (P < 0.0001); 2.4 (46%) vs 3.9 (51%) for 4-OH-tamoxifen (P < 0.0001); and 20 (91%) vs 33 (91%) for endoxifen (P < 0.02). On baseline, endoxifen levels tended to be lower in PM: 7 ng/mL (36%), than IM: 16 ng/mL (70%), P=0.08, and EM: 24 ng/mL (71%), P<0.001. After doubling tamoxifen dosage, endoxifen concentrations rose similarly in PM, IM and EM with respectively, 1.5 (18%), 1.5 (28%) and 1.7 (30%) fold increase from baseline, P=0.18. Conclusion: Endoxifen exposure varies widely under standard tamoxifen dosage, with CYP2D6 genotype explaining only a minor part of this variability. It increases consistently on doubling tamoxifen dose, similarly across genotypes. This would enable exposure optimization based on concentration monitoring.
Resumo:
A method for determining soil hydraulic properties of a weathered tropical soil (Oxisol) using a medium-sized column with undisturbed soil is presented. The method was used to determine fitting parameters of the water retention curve and hydraulic conductivity functions of a soil column in support of a pesticide leaching study. The soil column was extracted from a continuously-used research plot in Central Oahu (Hawaii, USA) and its internal structure was examined by computed tomography. The experiment was based on tension infiltration into the soil column with free outflow at the lower end. Water flow through the soil core was mathematically modeled using a computer code that numerically solves the one-dimensional Richards equation. Measured soil hydraulic parameters were used for direct simulation, and the retention and soil hydraulic parameters were estimated by inverse modeling. The inverse modeling produced very good agreement between model outputs and measured flux and pressure head data for the relatively homogeneous column. The moisture content at a given pressure from the retention curve measured directly in small soil samples was lower than that obtained through parameter optimization based on experiments using a medium-sized undisturbed soil column.
Resumo:
The quality of environmental data analysis and propagation of errors are heavily affected by the representativity of the initial sampling design [CRE 93, DEU 97, KAN 04a, LEN 06, MUL07]. Geostatistical methods such as kriging are related to field samples, whose spatial distribution is crucial for the correct detection of the phenomena. Literature about the design of environmental monitoring networks (MN) is widespread and several interesting books have recently been published [GRU 06, LEN 06, MUL 07] in order to clarify the basic principles of spatial sampling design (monitoring networks optimization) based on Support Vector Machines was proposed. Nonetheless, modelers often receive real data coming from environmental monitoring networks that suffer from problems of non-homogenity (clustering). Clustering can be related to the preferential sampling or to the impossibility of reaching certain regions.