126 resultados para Imbalanced datasets
Resumo:
The effect of differing the datasets used in the modelling of the Ni-like Gd x-ray laser (XRL) is examined through the 1.50 hydro-atomic code, EHYBRID. Two atomic datasets, including energy levels and radiative and collisional excitation rates, are used as input data for the code. It is found that the behaviour of the XRL is somewhat different than might be expected from superficial examination of the atomic data. The similarities in the gain profiles at low densities are found to have encouraging implications. in our attempts to model XRLs.
Resumo:
Spectral signal intensities, especially in 'real-world' applications with nonstandardized sample presentation due to uncontrolled variables/factors, commonly require additional spectral processing to normalize signal intensity in an effective way. In this study, we have demonstrated the complexity of choosing a normalization routine in the presence of multiple spectrally distinct constituents by probing a dataset of Raman spectra. Variation in absolute signal intensity (90.1% of total variance) of the Raman spectra of these complex biological samples swamps the variation in useful signals (9.4% of total variance), degrading its diagnostic and evaluative potential.
Resumo:
The problem of learning from imbalanced data is of critical importance in a large number of application domains and can be a bottleneck in the performance of various conventional learning methods that assume the data distribution to be balanced. The class imbalance problem corresponds to dealing with the situation where one class massively outnumbers the other. The imbalance between majority and minority would lead machine learning to be biased and produce unreliable outcomes if the imbalanced data is used directly. There has been increasing interest in this research area and a number of algorithms have been developed. However, independent evaluation of the algorithms is limited. This paper aims at evaluating the performance of five representative data sampling methods namely SMOTE, ADASYN, BorderlineSMOTE, SMOTETomek and RUSBoost that deal with class imbalance problems. A comparative study is conducted and the performance of each method is critically analysed in terms of assessment metrics. © 2013 Springer-Verlag.
Resumo:
BACKGROUND: While the discovery of new drugs is a complex, lengthy and costly process, identifying new uses for existing drugs is a cost-effective approach to therapeutic discovery. Connectivity mapping integrates gene expression profiling with advanced algorithms to connect genes, diseases and small molecule compounds and has been applied in a large number of studies to identify potential drugs, particularly to facilitate drug repurposing. Colorectal cancer (CRC) is a commonly diagnosed cancer with high mortality rates, presenting a worldwide health problem. With the advancement of high throughput omics technologies, a number of large scale gene expression profiling studies have been conducted on CRCs, providing multiple datasets in gene expression data repositories. In this work, we systematically apply gene expression connectivity mapping to multiple CRC datasets to identify candidate therapeutics to this disease.
RESULTS: We developed a robust method to compile a combined gene signature for colorectal cancer across multiple datasets. Connectivity mapping analysis with this signature of 148 genes identified 10 candidate compounds, including irinotecan and etoposide, which are chemotherapy drugs currently used to treat CRCs. These results indicate that we have discovered high quality connections between the CRC disease state and the candidate compounds, and that the gene signature we created may be used as a potential therapeutic target in treating the disease. The method we proposed is highly effective in generating quality gene signature through multiple datasets; the publication of the combined CRC gene signature and the list of candidate compounds from this work will benefit both cancer and systems biology research communities for further development and investigations.
Resumo:
This review paper discusses the use of Tellus and Tellus Border soil and stream geochemistry data to investigate the relationship between medical data and naturally occurring background levels of potentially toxic elements (PTEs) such as heavy metals in soils and water. The research hypothesis is that long-term low level oral exposure of PTEs via soil and water may result in cumulative exposures that may act as risk factors for progressive diseases including cancer and chronic kidney disease. A number of public policy implications for regional human health risk assessments, public health policy and education are also explored alongside the argument for better integration of multiple data sets to enhance ongoing medical and social research. This work presents a partnership between the School of Geography, Archaeology and Palaeoecology, Northern Ireland Cancer Registry, Queen’s University Belfast, and the nephrology (kidney medicine) research group.
Resumo:
Clusters of text documents output by clustering algorithms are often hard to interpret. We describe motivating real-world scenarios that necessitate reconfigurability and high interpretability of clusters and outline the problem of generating clusterings with interpretable and reconfigurable cluster models. We develop two clustering algorithms toward the outlined goal of building interpretable and reconfigurable cluster models. They generate clusters with associated rules that are composed of conditions on word occurrences or nonoccurrences. The proposed approaches vary in the complexity of the format of the rules; RGC employs disjunctions and conjunctions in rule generation whereas RGC-D rules are simple disjunctions of conditions signifying presence of various words. In both the cases, each cluster is comprised of precisely the set of documents that satisfy the corresponding rule. Rules of the latter kind are easy to interpret, whereas the former leads to more accurate clustering. We show that our approaches outperform the unsupervised decision tree approach for rule-generating clustering and also an approach we provide for generating interpretable models for general clusterings, both by significant margins. We empirically show that the purity and f-measure losses to achieve interpretability can be as little as 3 and 5%, respectively using the algorithms presented herein.
Resumo:
Supply Chain Simulation (SCS) is applied to acquire information to support outsourcing decisions but obtaining enough detail in key parameters can often be a barrier to making well informed decisions.
One aspect of SCS that has been relatively unexplored is the impact of inaccurate data around delays within the SC. The impact of the magnitude and variability of process cycle time on typical performance indicators in a SC context is studied.
System cycle time, WIP levels and throughput are more sensitive to the magnitude of deterministic deviations in process cycle time than variable deviations. Manufacturing costs are not very sensitive to these deviations.
Future opportunities include investigating the impact of process failure or product defects, including logistics and transportation between SC members and using alternative costing methodologies.
Resumo:
Blood-brain barrier (BBB) breakdown, demonstrable in vivo by enhanced MRI is characteristic of new and expanding inflammatory lesions in relapsing remitting and chronic progressive multiple sclerosis (MS). Subtle leakage may also occur in primary progressive MS. However, the anatomical route(s) of BBB leakage have not been demonstrated. We investigated the possible involvement of interendothelial tight junctions (TJ) by examining the expression of TJ proteins (occludin and ZO-1 ) in blood vessels in active MS lesions from 8 cases of MS and in normal-appearing white (NAWM) matter from 6 cases. Blood vessels (10-50 per frozen section) were scanned using confocal laser scanning microscopy to acquire datasets for analysis. TJ abnormalities manifested as beading, interruption, absence or diffuse cytoplasmic localization of fluorescence, or separation of junctions (putative opening) were frequent (affecting 40% of vessels) in oil red-O-positive active plaques but less frequent in NAWM (15%), and in normal (
Resumo:
The definitive paper by Stuiver and Polach (1977) established the conventions for reporting of 14C data for chronological and geophysical studies based on the radioactive decay of 14C in the sample since the year of sample death or formation. Several ways of reporting 14C activity levels relative to a standard were also established, but no specific instructions were given for reporting nuclear weapons testing (post-bomb) 14C levels in samples. Because the use of post-bomb 14C is becoming more prevalent in forensics, biology, and geosciences, a convention needs to be adopted. We advocate the use of fraction modern with a new symbol F14C to prevent confusion with the previously used Fm, which may or may not have been fractionation corrected. We also discuss the calibration of post-bomb 14C samples and the available datasets and compilations, but do not give a recommendation for a particular dataset.
Resumo:
We have previously published intermediate to hi,oh resolution spectroscopic observations of approximately 80 early B-type main-sequence stars situated in 19 Galactic open clusters/associations with Galactocentric distances distributed over 6 less than or equal to R-g less than or equal to 18 kpc. This current study collates and re-analyses these equivalent- width datasets using LTE and non-LTE model atmosphere techniques, in order to determine the stellar atmospheric parameters and abundance estimates for C, N, O, Mg, Al and Si. The latter should be representative of the present-day Galactic interstellar medium. Our extensive observational dataset permits the identification of sub-samples of stars with similar atmospheric parameters and of homogeneous subsets of lines. As such, this investigation represents the most extensive and systematic study of its kind to date. We conclude that the distribution of light elements (CI O, Mg & Si) in the Galactic disk can be represented by a linear, radial gradient of -0.07 +/- 0.01 dex kpc(-1) Our results for nitrogen and oxygen viz. (-0.09 +/- 0.01 dex kpc(-1) and -0.067 +/- 0.008 dex kpc(-1)) are in excellent agreement with that found from the study of HII regions. We have also examined our datasets for evidence of an abrupt discontinuity in the metallicity of the Galactic disk near a Galactocentric distance of 10 kpc (see Twarog et al. 1997). However, there is no evidence to suggest that our data would be better fitted with a two-zone model. Moreover, we observe a N/O gradient of -0.04 +/- 0.02 dex kpc(-1) which is consistent with that found for other spiral galaxies (Vila- Costas gr Edmunds 1993).