29 resultados para Imbalanced datasets


Relevância:

10.00% 10.00%

Publicador:

Resumo:

The Australian Soil Resources Information System (ASRIS) database compiles the best publicly available information available across Commonwealth, State, and Territory agencies into a national database of soil profile data, digital soil and land resources maps, and climate, terrain, and lithology datasets. These datasets are described in detail in this paper. Most datasets are thematic grids that cover the intensively used agricultural zones in Australia.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Background: A major goal in the post-genomic era is to identify and characterise disease susceptibility genes and to apply this knowledge to disease prevention and treatment. Rodents and humans have remarkably similar genomes and share closely related biochemical, physiological and pathological pathways. In this work we utilised the latest information on the mouse transcriptome as revealed by the RIKEN FANTOM2 project to identify novel human disease-related candidate genes. We define a new term patholog to mean a homolog of a human disease-related gene encoding a product ( transcript, anti-sense or protein) potentially relevant to disease. Rather than just focus on Mendelian inheritance, we applied the analysis to all potential pathologs regardless of their inheritance pattern. Results: Bioinformatic analysis and human curation of 60,770 RIKEN full-length mouse cDNA clones produced 2,578 sequences that showed similarity ( 70 - 85% identity) to known human-disease genes. Using a newly developed biological information extraction and annotation tool ( FACTS) in parallel with human expert analysis of 17,051 MEDLINE scientific abstracts we identified 182 novel potential pathologs. Of these, 36 were identified by computational tools only, 49 by human expert analysis only and 97 by both methods. These pathologs were related to neoplastic ( 53%), hereditary ( 24%), immunological ( 5%), cardio-vascular (4%), or other (14%), disorders. Conclusions: Large scale genome projects continue to produce a vast amount of data with potential application to the study of human disease. For this potential to be realised we need intelligent strategies for data categorisation and the ability to link sequence data with relevant literature. This paper demonstrates the power of combining human expert annotation with FACTS, a newly developed bioinformatics tool, to identify novel pathologs from within large-scale mouse transcript datasets.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

1. Cluster analysis of reference sites with similar biota is the initial step in creating River Invertebrate Prediction and Classification System (RIVPACS) and similar river bioassessment models such as Australian River Assessment System (AUSRIVAS). This paper describes and tests an alternative prediction method, Assessment by Nearest Neighbour Analysis (ANNA), based on the same philosophy as RIVPACS and AUSRIVAS but without the grouping step that some people view as artificial. 2. The steps in creating ANNA models are: (i) weighting the predictor variables using a multivariate approach analogous to principal axis correlations, (ii) calculating the weighted Euclidian distance from a test site to the reference sites based on the environmental predictors, (iii) predicting the faunal composition based on the nearest reference sites and (iv) calculating an observed/expected (O/E) analogous to RIVPACS/AUSRIVAS. 3. The paper compares AUSRIVAS and ANNA models on 17 datasets representing a variety of habitats and seasons. First, it examines each model's regressions for Observed versus Expected number of taxa, including the r(2), intercept and slope. Second, the two models' assessments of 79 test sites in New Zealand are compared. Third, the models are compared on test and presumed reference sites along a known trace metal gradient. Fourth, ANNA models are evaluated for western Australia, a geographically distinct region of Australia. The comparisons demonstrate that ANNA and AUSRIVAS are generally equivalent in performance, although ANNA turns out to be potentially more robust for the O versus E regressions and is potentially more accurate on the trace metal gradient sites. 4. The ANNA method is recommended for use in bioassessment of rivers, at least for corroborating the results of the well established AUSRIVAS- and RIVPACS-type models, if not to replace them.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Phylogenetic hypotheses are presented for Pultenaea based on cpDNA (trnL-F and ndhF) and nrDNA ( ITS) sequence data. Pultenaea, as it is currently circumscribed, comprises six strongly supported lineages whose relationships with each other and 18 closely related genera are weak or conflicting among datasets. The lack of resolution among the six Pultenaea clades and their relatives appears to be the result of a rapid radiation, which is evident in molecular data from both the chloroplast and nuclear genomes. The molecular data provide no support for the monophyly of Pultenaea as it currently stands. Given these results, Pultenaea could split into many smaller genera. We prefer the taxonomically stable alternative of subsuming all 19 genera currently recognised in Pultenaea sensu lato (= the Mirbelia group) into an expanded concept of Pultenaea that would comprise similar to 470 species.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Formal Concept Analysis is an unsupervised machine learning technique that has successfully been applied to document organisation by considering documents as objects and keywords as attributes. The basic algorithms of Formal Concept Analysis then allow an intelligent information retrieval system to cluster documents according to keyword views. This paper investigates the scalability of this idea. In particular we present the results of applying spatial data structures to large datasets in formal concept analysis. Our experiments are motivated by the application of the Formal Concept Analysis idea of a virtual filesystem [11,17,15]. In particular the libferris [1] Semantic File System. This paper presents customizations to an RD-Tree Generalized Index Search Tree based index structure to better support the application of Formal Concept Analysis to large data sources.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We inferred the phylogeny of 33 species of ticks from the subfamilies Rhipicephalinae and Hyalomminae from analyses of nuclear and mitochondrial DNA and morphology. We used nucleotide sequences from 12S rRNA, cytochrome c oxidase I, internal transcribed spacer 2 of the nuclear rRNA, and 18S rRNA. Nucleotide sequences and morphology were analyzed separately and together in a total-evidence analysis. Analyses of the five partitions together (3303 characters) gave the best-resolved and the best-supported hypothesis so far for the phylogeny of ticks in the Rhipicephalinae and Hyalomminae, despite the fact that some partitions did not have data for some taxa. However, most of the hidden conflict (lower support in the total-evidence analyses compared to that in the individual analyses) was found in those partitions that had taxa without data. The partitions with complete taxonomic sampling had more hidden support (higher support in the total-evidence analyses compared to that in the separate-partition analyses) than hidden conflict. Mapping of geographic origins of ticks onto our phylogeny indicates an African origin for the Rhipicephalinae sensu lato (i.e., including Hyalomma spp.), the Rhipicephalus-Boophilus lineage, the Dermacentor-Anocentor lineage, and the Rhipicephalus-Booophilus-Nosomma-Hyalomma-Rhipicentor lineage. The Nosomma-Hyalomma lineage appears to have evolved in Asia. Our total-evidence phylogeny indicates that (i) the genus Rhipicephalus is paraphyletic with respect to the genus Boophilus, (ii) the genus Dermacentor is paraphyletic with respect to the genus Anocentor, and (iii) some subgenera of the genera Hyalomma and Rhipicephalus are paraphyletic with respect to other subgenera in these genera. Study of the Rhipicephalinae and Hyalomminae over the last 7 years has shown that analyses of individual datasets (e.g., one gene or morphology) seldom resolve many phylogenetic relationships, but analyses of more than one dataset can generate well-resolved phylogenies for these ticks. (C) 2001 Academic Press.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The 16S rRNA gene (16S rDNA) is currently the most widely used gene for estimating the evolutionary history of prokaryotes, To date, there are more than 30 000 16S rDNA sequences available from the core databases, GenBank, EMBL and DDBJ, This great number may cause a dilemma when composing datasets for phylogenetic analysis, since the choice and number of reference organisms are known to affect the resulting tree topology. A group of sequences appearing monophyletic in one dataset may not be so in another. This can be especially problematic when establishing the relationships of distantly related sequences at the division (phylum) level. In this study, a multiple-outgroup approach to resolving division-level phylogenetic relationships is suggested using 16S rDNA data. The approach is illustrated by two case studies concerning the monophyly of two recently proposed bacterial divisions, OP9 and OP10.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

1. Schizophrenia is a chronic, disabling brain disease that affects approxmately 1% of the world's population. It is characterized by delusions, hallucinations and formal thought disorder, together with a decline in socio-occupational functioning. While the causes for schizophrenia remain unknown, evidence from family, twin and adoption studies clearly demonstrates that it aggregates in families, with this clustering largely attributable to genetic rather than cultural or environmental factors. Identifying the genes involved, however, has proven to be a difficult task because schizophrenia is a complex trait characterized by an imprecise phenotype, the existence of phenocopies and the presence of low disease penetrance, 2. The current working hypothesis for schizophrenia causation is that multiple genes of small to moderate effect confer compounding risk through interactions with each other and with non-genetic risk factors, The same genes may be commonly involved in conferring risk across populations or they may vary in number and strength between different populations. To search for evidence of such genetic loci, both candidate gene and genome-wide linkage studies have been used in clinical cohorts collected from a variety of populations. Collectively, these works provide some evidence for the involvement of a number of specific genes (e.g. the 5-hydroxytryptamine (5-HT) type 2a receptor (5-HT2a) gene and the dopamine D-3 receptor gene) and as yet unidentified factors localized to specific chromosomal regions, including 6p, 6q, 8p, 13q and 22q, These data provide suggestive, but no conclusive, evidence for causative genes. 3. To enable further progress there is a need to: (i) collect fine-grained clinical datasets while searching the schizophrenia phenotype for subgroups or dimensions that may provide a more direct route to causative genes; and (ii) integrate recent refinements in molecular genetic technology, including modern composite marker maps, DNA expression assays and relevant animal models, while using the latest analytical techniques to extract maximum information in order to help distinguish a true result from a false-positive finding.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Molecular evolution has been considered to be essentially a stochastic process, little influenced by the pace of phenotypic change. This assumption was challenged by a study that demonstrated an association between rates of morphological and molecular change estimated for total-evidence phylogenies, a finding that led some researchers to challenge molecular date estimates of major evolutionary radiations. Here we show that Omland's (1997) result is probably due to methodological bias, particularly phylogenetic nonindependence, rather than being indicative of an underlying evolutionary phenomenon. We apply three new methods specifically designed to overcome phylogenetic bias to 13 published phylogenetic datasets for vertebrate taxa, each of which includes both morphological characters and DNA sequence data. We find no evidence of an association between rates of molecular and morphological rates of change.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The isotope composition of Ph is difficult to determine accurately due to the lack of a stable normalisation ratio. Double and triple-spike addition techniques provide one solution and presently yield the most accurate measurements. A number of recent studies have claimed that improved accuracy and precision could also be achieved by multi-collector ICP-MS (MC-ICP-MS) Pb-isotope analysis using the addition of Tl of known isotope composition to Pb samples. In this paper, we verify whether the known isotope composition of Tl can be used for correction of mass discrimination of Pb with an extensive dataset for the NIST standard SRM 981, comparison of MC-ICP-MS with TIMS data, and comparison with three isochrons from different geological environments. When all our NIST SRM 981 data are normalised with one constant Tl-205/Tl-203 of 2.38869, the following averages and reproducibilities were obtained: Pb-207/Pb-206=0.91461+/-18; Pb-208/Ph-206 = 2.1674+/-7; and (PbPh)-Pb-206-Ph-204 = 16.941+/-6. These two sigma standard deviations of the mean correspond to 149, 330, and 374 ppm, respectively. Accuracies relative to triple-spike values are 149, 157, and 52 ppm, respectively, and thus well within uncertainties. The largest component of the uncertainties stems from the Ph data alone and is not caused by differential mass discrimination behaviour of Ph and Tl. In routine operation, variation of sample introduction memory and production of isobaric molecular interferences in the spectrometer's collision cell currently appear to be the ultimate limitation to better reproducibility. Comparative study of five different datasets from actual samples (bullets, international rock standards, carbonates, metamorphic minerals, and sulphide minerals) demonstrates that in most cases geological scatter of the sample exceeds the achieved analytical reproducibility. We observe good agreement between TIMS and MC-ICP-MS data for international rock standards but find that such comparison does not constitute the ultimate. test for the validity of the MC-ICP-MS technique. Two attempted isochrons resulted in geological scatter (in one case small) in excess of analytical reproducibility. However, in one case (leached Great Dyke sulphides) we obtained a true isochron (MSWD = 0.63) age of 2578.3 +/- 0.9 Ma, which is identical to and more precise than a recently published U-Pb zircon age (2579 3 Ma) for a Great Dyke websterite [Earth Planet. Sci. Lett. 180 (2000) 1-12]. Reproducibility of this age by means of an isochron we regard as a robust test of accuracy over a wide dynamic range. We show that reliable and accurate Pb-isotope data can be obtained by careful operation of second-generation MC-ICP magnetic sector mass spectrometers. (C) 2002 Elsevier Science B.V. All rights reserved.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper proposes a template for modelling complex datasets that integrates traditional statistical modelling approaches with more recent advances in statistics and modelling through an exploratory framework. Our approach builds on the well-known and long standing traditional idea of 'good practice in statistics' by establishing a comprehensive framework for modelling that focuses on exploration, prediction, interpretation and reliability assessment, a relatively new idea that allows individual assessment of predictions. The integrated framework we present comprises two stages. The first involves the use of exploratory methods to help visually understand the data and identify a parsimonious set of explanatory variables. The second encompasses a two step modelling process, where the use of non-parametric methods such as decision trees and generalized additive models are promoted to identify important variables and their modelling relationship with the response before a final predictive model is considered. We focus on fitting the predictive model using parametric, non-parametric and Bayesian approaches. This paper is motivated by a medical problem where interest focuses on developing a risk stratification system for morbidity of 1,710 cardiac patients given a suite of demographic, clinical and preoperative variables. Although the methods we use are applied specifically to this case study, these methods can be applied across any field, irrespective of the type of response.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Signal peptides and transmembrane helices both contain a stretch of hydrophobic amino acids. This common feature makes it difficult for signal peptide and transmembrane helix predictors to correctly assign identity to stretches of hydrophobic residues near the N-terminal methionine of a protein sequence. The inability to reliably distinguish between N-terminal transmembrane helix and signal peptide is an error with serious consequences for the prediction of protein secretory status or transmembrane topology. In this study, we report a new method for differentiating protein N-terminal signal peptides and transmembrane helices. Based on the sequence features extracted from hydrophobic regions (amino acid frequency, hydrophobicity, and the start position), we set up discriminant functions and examined them on non-redundant datasets with jackknife tests. This method can incorporate other signal peptide prediction methods and achieve higher prediction accuracy. For Gram-negative bacterial proteins, 95.7% of N-terminal signal peptides and transmembrane helices can be correctly predicted (coefficient 0.90). Given a sensitivity of 90%, transmembrane helices can be identified from signal peptides with a precision of 99% (coefficient 0.92). For eukaryotic proteins, 94.2% of N-terminal signal peptides and transmembrane helices can be correctly predicted with coefficient 0.83. Given a sensitivity of 90%, transmembrane helices can be identified from signal peptides with a precision of 87% (coefficient 0.85). The method can be used to complement current transmembrane protein prediction and signal peptide prediction methods to improve their prediction accuracies. (C) 2003 Elsevier Inc. All rights reserved.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper describes a process-based metapopulation dynamics and phenology model of prickly acacia, Acacia nilotica, an invasive alien species in Australia. The model, SPAnDX, describes the interactions between riparian and upland sub-populations of A. nilotica within livestock paddocks, including the effects of extrinsic factors such as temperature, soil moisture availability and atmospheric concentrations of carbon dioxide. The model includes the effects of management events such as changing the livestock species or stocking rate, applying fire, and herbicide application. The predicted population behaviour of A. nilotica was sensitive to climate. Using 35 years daily weather datasets for five representative sites spanning the range of conditions that A. nilotica is found in Australia, the model predicted biomass levels that closely accord with expected values at each site. SPAnDX can be used as a decision-support tool in integrated weed management, and to explore the sensitivity of cultural management practices to climate change throughout the range of A. nilotica. The cohort-based DYMEX modelling package used to build and run SPAnDX provided several advantages over more traditional population modelling approaches (e.g. an appropriate specific formalism (discrete time, cohort-based, process-oriented), user-friendly graphical environment, extensible library of reusable components, and useful and flexible input/output support framework). (C) 2003 Published by Elsevier Science B.V.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Translabial ultrasound is increasingly being used for the assessment of women presenting with pelvic floor dysfunction and incontinence (1,2). However, there is little information on normal values for bladder neck descent, with the two available studies disagreeing widely (3,4). No data has so far been published on mobility of the central and posterior compartment which can now also be assessed by ultrasound (5). This study presents normal values for urethral, bladder, cervical and rectal mobility in a cohort of young, stress continent, nulliparous nonpregnant women. Methods 118 nonpregnant nulliparous Caucasian women between 18 and 23 years of age were recruited for an ongoing twin study of pelvic floor function. Translabial ultrasound assessment of pelvic organ mobility was undertaken supine and after bladder emptying (6,7). The best of at least three effective Valsalva manoeuvres was used for evaluation, with no attempts at standardization of Valsalva pressure. Parameters of anterior compartment mobility were obtained by the use of on-screen calipers; cervical and rectal descent were evaluated on printouts. All examinations were carried out under direct supervision of the first author or by personnel trained by him for at least 100 consecutive assessments. Results The median age of participants in this study was 20 (range 18- 23). Mean body mass index was 23 (range 16.9- 36.7). Of 118 women, 2 were completely unable to perform a Valsalva manoeuvre despite repeated efforts at teaching and were excluded from analysis, as were ten women who complained of urinary stress incontinence, leaving 106 datasets. Average measurements for the parameters ‘retrovesical angle at rest’ (RVA-R) and on Valsalva (RVA-S), urethral rotation, bladder neck mobility, cysto-cele descent, cervical descent and descent of the rectal ampulla are given in Table 1.