957 resultados para Imbalanced datasets
Resumo:
Semiconductor fabrication involves several sequential processing steps with the result that critical production variables are often affected by a superposition of affects over multiple steps. In this paper a Virtual Metrology (VM) system for early stage measurement of such variables is presented; the VM system seeks to express the contribution to the output variability that is due to a defined observable part of the production line. The outputs of the processed system may be used for process monitoring and control purposes. A second contribution of this work is the introduction of Elastic Nets, a regularization and variable selection technique for the modelling of highly-correlated datasets, as a technique for the development of VM models. Elastic Nets and the proposed VM system are illustrated using real data from a multi-stage etch process used in the fabrication of disk drive read/write heads. © 2013 IEEE.
Resumo:
Many graph datasets are labelled with discrete and numeric attributes. Most frequent substructure discovery algorithms ignore numeric attributes; in this paper we show how they can be used to improve search performance and discrimination. Our thesis is that the most descriptive substructures are those which are normative both in terms of their structure and in terms of their numeric values. We explore the relationship between graph structure and the distribution of attribute values and propose an outlier-detection step, which is used as a constraint during substructure discovery. By pruning anomalous vertices and edges, more weight is given to the most descriptive substructures. Our method is applicable to multi-dimensional numeric attributes; we outline how it can be extended for high-dimensional data. We support our findings with experiments on transaction graphs and single large graphs from the domains of physical building security and digital forensics, measuring the effect on runtime, memory requirements and coverage of discovered patterns, relative to the unconstrained approach.
Resumo:
BACKGROUND: Whilst multimorbidity is more prevalent with increasing age, approximately 30% of middle-aged adults (45-64 years) are also affected. Several prescribing criteria have been developed to optimise medication use in older people (≥65 years) with little focus on potentially inappropriate prescribing (PIP) in middle-aged adults. We have developed a set of explicit prescribing criteria called PROMPT (PRescribing Optimally in Middle-aged People's Treatments) which may be applied to prescribing datasets to determine the prevalence of PIP in this age-group.
METHODS: A literature search was conducted to identify published prescribing criteria for all age groups, with the Project Steering Group (convened for this study) adding further criteria for consideration, all of which were reviewed for relevance to middle-aged adults. These criteria underwent a two-round Delphi process, using an expert panel consisting of general practitioners, pharmacists and clinical pharmacologists from the United Kingdom and Republic of Ireland. Using web-based questionnaires, 17 panellists were asked to indicate their level of agreement with each criterion via a 5-point Likert scale (1 = Strongly Disagree, 5 = Strongly Agree) to assess the applicability to middle-aged adults in the absence of clinical information. Criteria were accepted/rejected/revised dependent on the panel's level of agreement using the median response/interquartile range and additional comments.
RESULTS: Thirty-four criteria were rated in the first round of this exercise and consensus was achieved on 17 criteria which were accepted into the PROMPT criteria. Consensus was not reached on the remaining 17, and six criteria were removed following a review of the additional comments. The second round of this exercise focused on the remaining 11 criteria, some of which were revised following the first exercise. Five criteria were accepted from the second round, providing a final list of 22 criteria [gastro-intestinal system (n = 3), cardiovascular system (n = 4), respiratory system (n = 4), central nervous system (n = 6), infections (n = 1), endocrine system (n = 1), musculoskeletal system (n = 2), duplicates (n = 1)].
CONCLUSIONS: PROMPT is the first set of prescribing criteria developed for use in middle-aged adults. The utility of these criteria will be tested in future studies using prescribing datasets.
Resumo:
Beta diversity describes how local communities within an area or region differ in species composition/abundance. There have been attempts to use changes in beta diversity as a biotic indicator of disturbance, but lack of theory and methodological caveats have hampered progress. We here propose that the neutral theory of biodiversity plus the definition of beta diversity as the total variance of a community matrix provide a suitable, novel, starting point for ecological applications. Observed levels of beta diversity (BD) can be compared to neutral predictions with three possible outcomes: Observed BD equals neutral prediction or is larger (divergence) or smaller (convergence) than the neutral prediction. Disturbance might lead to either divergence or convergence, depending on type and strength. We here apply these ideas to datasets collected on oribatid mites (a key, very diverse soil taxon) under several regimes of disturbances. When disturbance is expected to increase the heterogeneity of soil spatial properties or the sampling strategy encompassed a range of diverging environmental conditions, we observed diverging assemblages. On the contrary, we observed patterns consistent with neutrality when disturbance could determine homogenization of soil properties in space or the sampling strategy encompassed fairly homogeneous areas. With our method, spatial and temporal changes in beta diversity can be directly and easily monitored to detect significant changes in community dynamics, although the method itself cannot inform on underlying mechanisms. However, human-driven disturbances and the spatial scales at which they operate are usually known. In this case, our approach allows the formulation of testable predictions in terms of expected changes in beta diversity, thereby offering a promising monitoring tool.
Resumo:
In this study, 137 corn distillers dried grains with solubles (DDGS) samples from a range of different geographical origins (Jilin Province of China, Heilongjiang Province of China, USA and Europe) were collected and analysed. Different near infrared spectrometers combined with different chemometric packages were used in two independent laboratories to investigate the feasibility of classifying geographical origin of DDGS. Base on the same dataset, one laboratory developed a partial least square discriminant analysis model and another laboratory developed an orthogonal partial least square discriminant analysis model. Results showed that both models could perfectly classify DDGS samples from different geographical origins. These promising results encourage the development of larger scale efforts to produce datasets which can be used to differentiate the geographical origin of DDGS and such efforts are required to provide higher level food security measures on a global scale.
Resumo:
Data registration refers to a series of techniques for matching or bringing similar objects or datasets together into alignment. These techniques enjoy widespread use in a diverse variety of applications, such as video coding, tracking, object and face detection and recognition, surveillance and satellite imaging, medical image analysis and structure from motion. Registration methods are as numerous as their manifold uses, from pixel level and block or feature based methods to Fourier domain methods.
This book is focused on providing algorithms and image and video techniques for registration and quality performance metrics. The authors provide various assessment metrics for measuring registration quality alongside analyses of registration techniques, introducing and explaining both familiar and state-of-the-art registration methodologies used in a variety of targeted applications.
Key features:
- Provides a state-of-the-art review of image and video registration techniques, allowing readers to develop an understanding of how well the techniques perform by using specific quality assessment criteria
- Addresses a range of applications from familiar image and video processing domains to satellite and medical imaging among others, enabling readers to discover novel methodologies with utility in their own research
- Discusses quality evaluation metrics for each application domain with an interdisciplinary approach from different research perspectives
Resumo:
Classification methods with embedded feature selection capability are very appealing for the analysis of complex processes since they allow the analysis of root causes even when the number of input variables is high. In this work, we investigate the performance of three techniques for classification within a Monte Carlo strategy with the aim of root cause analysis. We consider the naive bayes classifier and the logistic regression model with two different implementations for controlling model complexity, namely, a LASSO-like implementation with a L1 norm regularization and a fully Bayesian implementation of the logistic model, the so called relevance vector machine. Several challenges can arise when estimating such models mainly linked to the characteristics of the data: a large number of input variables, high correlation among subsets of variables, the situation where the number of variables is higher than the number of available data points and the case of unbalanced datasets. Using an ecological and a semiconductor manufacturing dataset, we show advantages and drawbacks of each method, highlighting the superior performance in term of classification accuracy for the relevance vector machine with respect to the other classifiers. Moreover, we show how the combination of the proposed techniques and the Monte Carlo approach can be used to get more robust insights into the problem under analysis when faced with challenging modelling conditions.
Resumo:
Continuous research endeavors on hard turning (HT), both on machine tools and cutting tools, have made the previously reported daunting limits easily attainable in the modern scenario. This presents an opportunity for a systematic investigation on finding the current attainable limits of hard turning using a CNC turret lathe. Accordingly, this study aims to contribute to the existing literature by providing the latest experimental results of hard turning of AISI 4340 steel (69 HRC) using a CBN cutting tool. An orthogonal array was developed using a set of judiciously chosen cutting parameters. Subsequently, the longitudinal turning trials were carried out in accordance with a well-designed full factorial-based Taguchi matrix. The speculation indeed proved correct as a mirror finished optical quality machined surface (an average surface roughness value of 45 nm) was achieved by the conventional cutting method. Furthermore, Signal-to-noise (S/N) ratio analysis, Analysis of variance (ANOVA), and Multiple regression analysis were carried out on the experimental datasets to assert the dominance of each machining variable in dictating the machined surface roughness and to optimize the machining parameters. One of the key findings was that when feed rate during hard turning approaches very low (about 0.02mm/rev), it could alone be most significant (99.16%) parameter in influencing the machined surface roughness (Ra). This has, however also been shown that low feed rate results in high tool wear, so the selection of machining parameters for carrying out hard turning must be governed by a trade-off between the cost and quality considerations.
Resumo:
Nematode neuropeptide systems comprise an exceptionally complex array of similar to 250 peptidic signaling molecules that operate within a structurally simple nervous system of similar to 300 neurons. A relatively complete picture of the neuropeptide complement is available for Caenorhabditis elegans, with 30 flp, 38 ins and 43 nlp genes having been documented; accumulating evidence indicates similar complexity in parasitic nematodes from clades I, III, IV and V. In contrast, the picture for parasitic platyhelminths is less clear, with the limited peptide sequence data available providing concrete evidence for only FMRFamide-like peptide (FLP) and neuropeptide F (NPF) signaling systems, each of which only comprises one or two peptides. With the completion of the Schmidtea meditteranea and Schistosoma mansoni genome projects and expressed sequence tag datasets for other flatworm parasites becoming available, the time is ripe for a detailed reanalysis of neuropeptide signaling in flatworms. Although the actual neuropeptides provide limited obvious value as targets for chemotherapeutic-based control strategies, they do highlight the signaling systems present in these helminths and provide tools for the discovery of more amenable targets such as neuropeptide receptors or neuropeptide processing enzymes. Also, they offer opportunities to evaluate the potential of their associated signaling pathways as targets through RNA interference (RNAi)-based, target validation strategies. Currently, within both helminth phyla, the flp signaling systems appear to merit further investigation as they are intrinsically linked with motor function, a proven target for successful anti-parasitics; it is clear that some nematode NLPs also play a role in motor function and could have similar appeal. At this time, it is unclear if flatworm NPF and nematode INS peptides operate in pathways that have utility for parasite control. Clearly, RNAi-based validation could be a starting point for scoring potential target pathways within neuropeptide signaling for parasiticide discovery programs. Also, recent successes in the application of in planta-based RNAi control strategies for plant parasitic nematodes reveal a strategy whereby neuropeptide encoding genes could become targets for parasite control. The possibility of developing these approaches for the control of animal and human parasites is intriguing, but will require significant advances in the delivery of RNAi-triggers.
Resumo:
The article examines everyday life in Northern Ireland’s segregated communities and focus on a neglected empirical dimension of ethnic and social segregation developed within the socio-spatial relations between people and their built environment. It shows how the everyday urban encounters are reproduced through negotiating differences and the ways in which living in divided communities lead to social inequality and imbalanced use of space. The article employed qualitative research methods with individuals and community groups from the Fountain estate, a small Protestant enclave in Derry/Londonderry. Their stories were replete with cases of injustice and insights into the daily struggles that have generally occurred within theories of contact and social segregation as a whole. In fact, people in the Fountain presented their own intertextual references on what was more significant for them as a matter of routine survival and belonging, which allowed them to be more constructive about themselves. While segregation has persisted for multiple decades; time is believed to be the factor most likely to change it, as it is hoped that the younger generation will provide lasting change to Northern Ireland and eventual peace between currently segregated communities.
Resumo:
Biodiversity continues to decline in the face of increasing anthropogenic pressures such as habitat destruction, exploitation, pollution and introduction of alien species. Existing global databases of species' threat status or population time series are dominated by charismatic species. The collation of datasets with broad taxonomic and biogeographic extents, and that support computation of a range of biodiversity indicators, is necessary to enable better understanding of historical declines and to project - and avert - future declines. We describe and assess a new database of more than 1.6 million samples from 78 countries representing over 28,000 species, collated from existing spatial comparisons of local-scale biodiversity exposed to different intensities and types of anthropogenic pressures, from terrestrial sites around the world. The database contains measurements taken in 208 (of 814) ecoregions, 13 (of 14) biomes, 25 (of 35) biodiversity hotspots and 16 (of 17) megadiverse countries. The database contains more than 1% of the total number of all species described, and more than 1% of the described species within many taxonomic groups - including flowering plants, gymnosperms, birds, mammals, reptiles, amphibians, beetles, lepidopterans and hymenopterans. The dataset, which is still being added to, is therefore already considerably larger and more representative than those used by previous quantitative models of biodiversity trends and responses. The database is being assembled as part of the PREDICTS project (Projecting Responses of Ecological Diversity In Changing Terrestrial Systems - http://www.predicts.org.uk). We make site-level summary data available alongside this article. The full database will be publicly available in 2015.
Resumo:
A role for the minichromosome maintenance (MCM) proteins in cancer initiation and progression is slowly emerging. Functioning as a complex to ensure a single chromosomal replication per cell cycle, the six family members have been implicated in several neoplastic disease states, including breast cancer. Our study aim to investigate the prognostic significance of these proteins in breast cancer. We studied the expression of MCMs in various datasets and the associations of the expression with clinicopathological parameters. When considered alone, high level MCM4 overexpression was only weakly associated with shorter survival in the combined breast cancer patient cohort (n = 1441, Hazard Ratio = 1.31; 95% Confidence Interval = 1.11-1.55; p = 0.001). On the other hand, when we studied all six components of the MCM complex, we found that overexpression of all MCMs was strongly associated with shorter survival in the same cohort (n = 1441, Hazard Ratio = 1.75; 95% Confidence Interval = 1.31-2.34; p <0.001), suggesting these MCM proteins may cooperate to promote breast cancer progression. Indeed, their expressions were significantly correlated with each other in these cohorts. In addition, we found that increasing number of overexpressed MCMs was associated with negative ER status as well as treatment response. Together, our findings are reproducible in seven independent breast cancer cohorts, with 1441 patients, and suggest that MCM profiling could potentially be used to predict response to treatment and prognosis in breast cancer patients.
Resumo:
Retrospective clinical datasets are often characterized by a relatively small sample size and many missing data. In this case, a common way for handling the missingness consists in discarding from the analysis patients with missing covariates, further reducing the sample size. Alternatively, if the mechanism that generated the missing allows, incomplete data can be imputed on the basis of the observed data, avoiding the reduction of the sample size and allowing methods to deal with complete data later on. Moreover, methodologies for data imputation might depend on the particular purpose and might achieve better results by considering specific characteristics of the domain. The problem of missing data treatment is studied in the context of survival tree analysis for the estimation of a prognostic patient stratification. Survival tree methods usually address this problem by using surrogate splits, that is, splitting rules that use other variables yielding similar results to the original ones. Instead, our methodology consists in modeling the dependencies among the clinical variables with a Bayesian network, which is then used to perform data imputation, thus allowing the survival tree to be applied on the completed dataset. The Bayesian network is directly learned from the incomplete data using a structural expectation–maximization (EM) procedure in which the maximization step is performed with an exact anytime method, so that the only source of approximation is due to the EM formulation itself. On both simulated and real data, our proposed methodology usually outperformed several existing methods for data imputation and the imputation so obtained improved the stratification estimated by the survival tree (especially with respect to using surrogate splits).
Resumo:
This paper presents a machine learning approach to sarcasm detection on Twitter in two languages – English and Czech. Although there has been some research in sarcasm detection in languages other than English (e.g., Dutch, Italian, and Brazilian Portuguese), our work is the first attempt at sarcasm detection in the Czech language. We created a large Czech Twitter corpus consisting of 7,000 manually-labeled tweets and provide it to the community. We evaluate two classifiers with various combinations of features on both the Czech and English datasets. Furthermore, we tackle the issues of rich Czech morphology by examining different preprocessing techniques. Experiments show that our language-independent approach significantly outperforms adapted state-of-the-art methods in English (F-measure 0.947) and also represents a strong baseline for further research in Czech (F-measure 0.582).
Resumo:
Ellerman Bombs (EBs) are thought to arise as a result of photospheric magnetic reconnection. We use data from the Swedish 1-m Solar Telescope(SST), to study EB events on the solar disk and at the limb. Both datasets show that EBs are connected to the foot-points of forming chromospheric jets. The limb observations show that a bright structure in the H$\alpha$ blue wing connects to the EB initially fuelling it,leading to the ejection of material upwards. The material moves along a loop structure where a newly formed jet is subsequently observed in the red wing of H$\alpha$. In the disk dataset, an EB initiates a jet which propagates away from the apparent reconnection site within the EB flame.The EB then splits into two, with associated brightenings in the inter-granular lanes (IGLs). Micro-jets are then observed, extending to500 km with a lifetime of a few minutes. Observed velocities of themicro-jets are approximately 5-10 km s$^{-1}$, while their chromospheric counterparts range from 50-80 km s$^{-1}$. MURaM simulations of quiet Sun reconnection show that micro-jets with similar properties to that of the observations follow the line of reconnection in the photosphere,with associated H$\alpha$ brightening at the location of increased temperature.