847 resultados para Classification Methods
Resumo:
Background: International data on child maltreatment are largely derived from child protection agencies, and predominantly report only substantiated cases of child maltreatment. This approach underestimates the incidence of maltreatment and makes inter-jurisdictional comparisons difficult. There has been a growing recognition of the importance of health professionals in identifying, documenting and reporting suspected child maltreatment. This study aimed to describe the issues around case identification using coded morbidity data, outline methods for selecting and grouping relevant codes, and illustrate patterns of maltreatment identified. Methods: A comprehensive review of the ICD-10-AM classification system was undertaken, including review of index terms, a free text search of tabular volumes, and a review of coding standards pertaining to child maltreatment coding. Identified codes were further categorised into maltreatment types including physical abuse, sexual abuse, emotional or psychological abuse, and neglect. Using these code groupings, one year of Australian hospitalisation data for children under 18 years of age was examined to quantify the proportion of patients identified and to explore the characteristics of cases assigned maltreatment-related codes. Results: Less than 0.5% of children hospitalised in Australia between 2005 and 2006 had a maltreatment code assigned, almost 4% of children with a principal diagnosis of a mental and behavioural disorder and over 1% of children with an injury or poisoning as the principal diagnosis had a maltreatment code assigned. The patterns of children assigned with definitive T74 codes varied by sex and age group. For males selected as having a maltreatment-related presentation, physical abuse was most commonly coded (62.6% of maltreatment cases) while for females selected as having a maltreatment-related presentation, sexual abuse was the most commonly assigned form of maltreatment (52.9% of maltreatment cases). Conclusion: This study has demonstrated that hospital data could provide valuable information for routine monitoring and surveillance of child maltreatment, even in the absence of population-based linked data sources. With national and international calls for a public health response to child maltreatment, better understanding of, investment in and utilisation of our core national routinely collected data sources will enhance the evidence-base needed to support an appropriate response to children at risk.
Resumo:
Background: The Current Population Survey (CPS) and the American Time Use Survey (ATUS) use the 2002 census occupation system to classify workers into 509 separate occupations arranged into 22 major occupational categories. Methods: We describe the methods and rationale for assigning detailed MET estimates to occupations and present population estimates (comparing outputs generated by analysis of previously published summary MET estimates to the detailed MET estimates) of intensities of occupational activity using the 2003 ATUS data comprised of 20,720 respondents, 5,323 (2,917 males and 2,406 females) of whom reported working 6+ hours at their primary occupation on their assigned reporting day. Results: Analysis using the summary MET estimates resulted in 4% more workers in sedentary occupations, 6% more in light, 7% less in moderate, and 3% less in vigorous compared to using the detailed MET estimates. The detailed estimates are more sensitive to identifying individuals who do any occupational activity that is moderate or vigorous in intensity resulting in fewer workers in sedentary and light intensity occupations. Conclusions: Since CPS/ATUS regularly captures occupation data it will be possible to track prevalence of the different intensity levels of occupations. Updates will be required with inevitable adjustments to future occupational classification systems.
Resumo:
Workflow nets, a particular class of Petri nets, have become one of the standard ways to model and analyze workflows. Typically, they are used as an abstraction of the workflow that is used to check the so-called soundness property. This property guarantees the absence of livelocks, deadlocks, and other anomalies that can be detected without domain knowledge. Several authors have proposed alternative notions of soundness and have suggested to use more expressive languages, e.g., models with cancellations or priorities. This paper provides an overview of the different notions of soundness and investigates these in the presence of different extensions of workflow nets.We will show that the eight soundness notions described in the literature are decidable for workflow nets. However, most extensions will make all of these notions undecidable. These new results show the theoretical limits of workflow verification. Moreover, we discuss some of the analysis approaches described in the literature.
Resumo:
Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0–1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0–1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function—that it satisfies a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise, and show that in this case, strictly convex loss functions lead to faster rates of convergence of the risk than would be implied by standard uniform convergence arguments. Finally, we present applications of our results to the estimation of convergence rates in function classes that are scaled convex hulls of a finite-dimensional base class, with a variety of commonly used loss functions.
Resumo:
One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the bias-variance decomposition.
Resumo:
Background: Strategies for cancer reduction and management are targeted at both individual and area levels. Area-level strategies require careful understanding of geographic differences in cancer incidence, in particular the association with factors such as socioeconomic status, ethnicity and accessibility. This study aimed to identify the complex interplay of area-level factors associated with high area-specific incidence of Australian priority cancers using a classification and regression tree (CART) approach. Methods: Area-specific smoothed standardised incidence ratios were estimated for priority-area cancers across 478 statistical local areas in Queensland, Australia (1998-2007, n=186,075). For those cancers with significant spatial variation, CART models were used to identify whether area-level accessibility, socioeconomic status and ethnicity were associated with high area-specific incidence. Results: The accessibility of a person’s residence had the most consistent association with the risk of cancer diagnosis across the specific cancers. Many cancers were likely to have high incidence in more urban areas, although male lung cancer and cervical cancer tended to have high incidence in more remote areas. The impact of socioeconomic status and ethnicity on these associations differed by type of cancer. Conclusions: These results highlight the complex interactions between accessibility, socioeconomic status and ethnicity in determining cancer incidence risk.
Resumo:
Follicle classification is an important aid to the understanding of follicular development and atresia. Some bovine primordial follicles have the classical primordial shape, but ellipsoidal shaped follicles with some cuboidal granulosa cells at the poles are far more common. Preantral follicles have one of two basal lamina phenotypes, either a single aligned layer or one with additional layers. In antral follicles <5 mm diameter, half of the healthy follicles have columnar shaped basal granulosa cells and additional layers of basal lamina, which appear as loops in cross section (‘loopy’). The remainder have aligned single-layered follicular basal laminas with rounded basal cells, and contain better quality oocytes than the loopy/columnar follicles. In sizes >5 mm, only aligned/rounded phenotypes are present. Dominant and subordinate follicles can be identified by ultrasound and/or histological examination of pairs of ovaries. Atretic follicles <5 mm are either basal atretic or antral atretic, named on the basis of the location in the membrana granulosa where cells die first. Basal atretic follicles have considerable biological differences to antral atretic follicles. In follicles >5 mm, only antral atresia is observed. The concentrations of follicular fluid steroid hormones can be used to classify atresia and distinguish some of the different types of atresia; however, this method is unlikely to identify follicles early in atresia, and hence misclassify them as healthy. Other biochemical and histological methods can be used, but since cell death is a part of normal homoeostatis, deciding when a follicle has entered atresia remains somewhat subjective.
Resumo:
In this paper, we describe the main processes and operations in mining industries and present a comprehensive survey of operations research methodologies that have been applied over the last several decades. The literature review is classified into four main categories: mine design; mine production; mine transportation; and mine evaluation. Mining design models are further separated according to two main mining methods: open-pit and underground. Moreover, mine production models are subcategorised into two groups: ore mining and coal mining. Mine transportation models are further partitioned in accordance with fleet management, truck haulage and train scheduling. Mine evaluation models are further subdivided into four clusters in terms of mining method selection, quality control, financial risks and environmental protection. The main characteristics of four Australian commercial mining software are addressed and compared. This paper bridges the gaps in the literature and motivates researchers to develop more applicable, realistic and comprehensive operations research models and solution techniques that are directly linked with mining industries.
Resumo:
It is a big challenge to acquire correct user profiles for personalized text classification since users may be unsure in providing their interests. Traditional approaches to user profiling adopt machine learning (ML) to automatically discover classification knowledge from explicit user feedback in describing personal interests. However, the accuracy of ML-based methods cannot be significantly improved in many cases due to the term independence assumption and uncertainties associated with them. This paper presents a novel relevance feedback approach for personalized text classification. It basically applies data mining to discover knowledge from relevant and non-relevant text and constraints specific knowledge by reasoning rules to eliminate some conflicting information. We also developed a Dempster-Shafer (DS) approach as the means to utilise the specific knowledge to build high-quality data models for classification. The experimental results conducted on Reuters Corpus Volume 1 and TREC topics support that the proposed technique achieves encouraging performance in comparing with the state-of-the-art relevance feedback models.
Resumo:
Load in distribution networks is normally measured at the 11kV supply points; little or no information is known about the type of customers and their contributions to the load. This paper proposes statistical methods to decompose an unknown distribution feeder load to its customer load sector/subsector profiles. The approach used in this paper should assist electricity suppliers in economic load management, strategic planning and future network reinforcements.
Resumo:
Bridges are currently rated individually for maintenance and repair action according to the structural conditions of their elements. Dealing with thousands of bridges and the many factors that cause deterioration, makes this rating process extremely complicated. The current simplified but practical methods are not accurate enough. On the other hand, the sophisticated, more accurate methods are only used for a single or particular bridge type. It is therefore necessary to develop a practical and accurate rating system for a network of bridges. The first most important step in achieving this aim is to classify bridges based on the differences in nature and the unique characteristics of the critical factors and the relationship between them, for a network of bridges. Critical factors and vulnerable elements will be identified and placed in different categories. This classification method will be used to develop a new practical rating method for a network of railway bridges based on criticality and vulnerability analysis. This rating system will be more accurate and economical as well as improve the safety and serviceability of railway bridges.
Resumo:
The detection and correction of defects remains among the most time consuming and expensive aspects of software development. Extensive automated testing and code inspections may mitigate their effect, but some code fragments are necessarily more likely to be faulty than others, and automated identification of fault prone modules helps to focus testing and inspections, thus limiting wasted effort and potentially improving detection rates. However, software metrics data is often extremely noisy, with enormous imbalances in the size of the positive and negative classes. In this work, we present a new approach to predictive modelling of fault proneness in software modules, introducing a new feature representation to overcome some of these issues. This rank sum representation offers improved or at worst comparable performance to earlier approaches for standard data sets, and readily allows the user to choose an appropriate trade-off between precision and recall to optimise inspection effort to suit different testing environments. The method is evaluated using the NASA Metrics Data Program (MDP) data sets, and performance is compared with existing studies based on the Support Vector Machine (SVM) and Naïve Bayes (NB) Classifiers, and with our own comprehensive evaluation of these methods.
Resumo:
Spatially-explicit modelling of grassland classes is important to site-specific planning for improving grassland and environmental management over large areas. In this study, a climate-based grassland classification model, the Comprehensive and Sequential Classification System (CSCS) was integrated with spatially interpolated climate data to classify grassland in Gansu province, China. The study area is characterized by complex topographic features imposed by plateaus, high mountains, basins and deserts. To improve the quality of the interpolated climate data and the quality of the spatial classification over this complex topography, three linear regression methods, namely an analytic method based on multiple regression and residues (AMMRR), a modification of the AMMRR method through adding the effect of slope and aspect to the interpolation analysis (M-AMMRR) and a method which replaces the IDW approach for residue interpolation in M-AMMRR with an ordinary kriging approach (I-AMMRR), for interpolating climate variables were evaluated. The interpolation outcomes from the best interpolation method were then used in the CSCS model to classify the grassland in the study area. Climate variables interpolated included the annual cumulative temperature and annual total precipitation. The results indicated that the AMMRR and M-AMMRR methods generated acceptable climate surfaces but the best model fit and cross validation result were achieved by the I-AMMRR method. Twenty-six grassland classes were classified for the study area. The four grassland vegetation classes that covered more than half of the total study area were "cool temperate-arid temperate zonal semi-desert", "cool temperate-humid forest steppe and deciduous broad-leaved forest", "temperate-extra-arid temperate zonal desert", and "frigid per-humid rain tundra and alpine meadow". The vegetation classification map generated in this study provides spatial information on the locations and extents of the different grassland classes. This information can be used to facilitate government agencies' decision-making in land-use planning and environmental management, and for vegetation and biodiversity conservation. The information can also be used to assist land managers in the estimation of safe carrying capacities which will help to prevent overgrazing and land degradation.
Resumo:
Next Generation Sequencing (NGS) has revolutionised molecular biology, resulting in an explosion of data sets and an increasing role in clinical practice. Such applications necessarily require rapid identification of the organism as a prelude to annotation and further analysis. NGS data consist of a substantial number of short sequence reads, given context through downstream assembly and annotation, a process requiring reads consistent with the assumed species or species group. Highly accurate results have been obtained for restricted sets using SVM classifiers, but such methods are difficult to parallelise and success depends on careful attention to feature selection. This work examines the problem at very large scale, using a mix of synthetic and real data with a view to determining the overall structure of the problem and the effectiveness of parallel ensembles of simpler classifiers (principally random forests) in addressing the challenges of large scale genomics.