461 resultados para statistical techniques
Resumo:
A user’s query is considered to be an imprecise description of their information need. Automatic query expansion is the process of reformulating the original query with the goal of improving retrieval effectiveness. Many successful query expansion techniques ignore information about the dependencies that exist between words in natural language. However, more recent approaches have demonstrated that by explicitly modeling associations between terms significant improvements in retrieval effectiveness can be achieved over those that ignore these dependencies. State-of-the-art dependency-based approaches have been shown to primarily model syntagmatic associations. Syntagmatic associations infer a likelihood that two terms co-occur more often than by chance. However, structural linguistics relies on both syntagmatic and paradigmatic associations to deduce the meaning of a word. Given the success of dependency-based approaches and the reliance on word meanings in the query formulation process, we argue that modeling both syntagmatic and paradigmatic information in the query expansion process will improve retrieval effectiveness. This article develops and evaluates a new query expansion technique that is based on a formal, corpus-based model of word meaning that models syntagmatic and paradigmatic associations. We demonstrate that when sufficient statistical information exists, as in the case of longer queries, including paradigmatic information alone provides significant improvements in retrieval effectiveness across a wide variety of data sets. More generally, when our new query expansion approach is applied to large-scale web retrieval it demonstrates significant improvements in retrieval effectiveness over a strong baseline system, based on a commercial search engine.
Resumo:
Introduction: Recent advances in the planning and delivery of radiotherapy treatments have resulted in improvements in the accuracy and precision with which therapeutic radiation can be administered. As the complexity of the treatments increases it becomes more difficult to predict the dose distribution in the patient accurately. Monte Carlo (MC) methods have the potential to improve the accuracy of the dose calculations and are increasingly being recognised as the ‘gold standard’ for predicting dose deposition in the patient [1]. This project has three main aims: 1. To develop tools that enable the transfer of treatment plan information from the treatment planning system (TPS) to a MC dose calculation engine. 2. To develop tools for comparing the 3D dose distributions calculated by the TPS and the MC dose engine. 3. To investigate the radiobiological significance of any errors between the TPS patient dose distribution and the MC dose distribution in terms of Tumour Control Probability (TCP) and Normal Tissue Complication Probabilities (NTCP). The work presented here addresses the first two aims. Methods: (1a) Plan Importing: A database of commissioned accelerator models (Elekta Precise and Varian 2100CD) has been developed for treatment simulations in the MC system (EGSnrc/BEAMnrc). Beam descriptions can be exported from the TPS using the widespread DICOM framework, and the resultant files are parsed with the assistance of a software library (PixelMed Java DICOM Toolkit). The information in these files (such as the monitor units, the jaw positions and gantry orientation) is used to construct a plan-specific accelerator model which allows an accurate simulation of the patient treatment field. (1b) Dose Simulation: The calculation of a dose distribution requires patient CT images which are prepared for the MC simulation using a tool (CTCREATE) packaged with the system. Beam simulation results are converted to absolute dose per- MU using calibration factors recorded during the commissioning process and treatment simulation. These distributions are combined according to the MU meter settings stored in the exported plan to produce an accurate description of the prescribed dose to the patient. (2) Dose Comparison: TPS dose calculations can be obtained using either a DICOM export or by direct retrieval of binary dose files from the file system. Dose difference, gamma evaluation and normalised dose difference algorithms [2] were employed for the comparison of the TPS dose distribution and the MC dose distribution. These implementations are spatial resolution independent and able to interpolate for comparisons. Results and Discussion: The tools successfully produced Monte Carlo input files for a variety of plans exported from the Eclipse (Varian Medical Systems) and Pinnacle (Philips Medical Systems) planning systems: ranging in complexity from a single uniform square field to a five-field step and shoot IMRT treatment. The simulation of collimated beams has been verified geometrically, and validation of dose distributions in a simple body phantom (QUASAR) will follow. The developed dose comparison algorithms have also been tested with controlled dose distribution changes. Conclusion: The capability of the developed code to independently process treatment plans has been demonstrated. A number of limitations exist: only static fields are currently supported (dynamic wedges and dynamic IMRT will require further development), and the process has not been tested for planning systems other than Eclipse and Pinnacle. The tools will be used to independently assess the accuracy of the current treatment planning system dose calculation algorithms for complex treatment deliveries such as IMRT in treatment sites where patient inhomogeneities are expected to be significant. Acknowledgements: Computational resources and services used in this work were provided by the HPC and Research Support Group, Queensland University of Technology, Brisbane, Australia. Pinnacle dose parsing made possible with the help of Paul Reich, North Coast Cancer Institute, North Coast, New South Wales.
Resumo:
Matched case–control research designs can be useful because matching can increase power due to reduced variability between subjects. However, inappropriate statistical analysis of matched data could result in a change in the strength of association between the dependent and independent variables or a change in the significance of the findings. We sought to ascertain whether matched case–control studies published in the nursing literature utilized appropriate statistical analyses. Of 41 articles identified that met the inclusion criteria, 31 (76%) used an inappropriate statistical test for comparing data derived from case subjects and their matched controls. In response to this finding, we developed an algorithm to support decision-making regarding statistical tests for matched case–control studies.
Resumo:
The increased adoption of business process management approaches, tools and practices, has led organizations to accumulate large collections of business process models. These collections can easily include hundred to thousand models, especially in the context of multinational corporations or as a result of organizational mergers and acquisitions. A concrete problem is thus how to maintain these large repositories in such a way that their complexity does not hamper their practical usefulness as a means to describe and communicate business operations. This paper proposes a technique to automatically infer suitable names for business process models and fragments thereof. This technique is useful for model abstraction scenarios, as for instance when user-specific views of a repository are required, or as part of a refactoring initiative aimed to simplify the repository’s complexity. The technique is grounded in an adaptation of the theory of meaning to the realm of business process models. We implemented the technique in a prototype tool and conducted an extensive evaluation using three process model collections from practice and a case study involving process modelers with different experience.
Resumo:
When a community already torn by an event such as a prolonged war, is then hit by a natural disaster, the negative impact of this subsequent disaster in the longer term can be extremely devastating. Natural disasters further damage already destabilised and demoralised communities, making it much harder for them to be resilient and recover. Communities often face enormous challenges during the immediate recovery and the subsequent long term reconstruction periods, mainly due to the lack of a viable community involvement process. In post-war settings, affected communities, including those internally displaced, are often conceived as being completely disabled and are hardly ever consulted when reconstruction projects are being instigated. This lack of community involvement often leads to poor project planning, decreased community support, and an unsustainable completed project. The impact of war, coupled with the tensions created by the uninhabitable and poor housing provision, often hinders the affected residents from integrating permanently into their home communities. This paper outlines a number of fundamental factors that act as barriers to community participation related to natural disasters in post-war settings. The paper is based on a statistical analysis of, and findings from, a questionnaire survey administered in early 2012 in Afghanistan.
Resumo:
In a classification problem typically we face two challenging issues, the diverse characteristic of negative documents and sometimes a lot of negative documents that are closed to positive documents. Therefore, it is hard for a single classifier to clearly classify incoming documents into classes. This paper proposes a novel gradual problem solving to create a two-stage classifier. The first stage identifies reliable negatives (negative documents with weak positive characteristics). It concentrates on minimizing the number of false negative documents (recall-oriented). We use Rocchio, an existing recall based classifier, for this stage. The second stage is a precision-oriented “fine tuning”, concentrates on minimizing the number of false positive documents by applying pattern (a statistical phrase) mining techniques. In this stage a pattern-based scoring is followed by threshold setting (thresholding). Experiment shows that our statistical phrase based two-stage classifier is promising.
Resumo:
Background: Developing sampling strategies to target biological pests such as insects in stored grain is inherently difficult owing to species biology and behavioural characteristics. The design of robust sampling programmes should be based on an underlying statistical distribution that is sufficiently flexible to capture variations in the spatial distribution of the target species. Results: Comparisons are made of the accuracy of four probability-of-detection sampling models - the negative binomial model,1 the Poisson model,1 the double logarithmic model2 and the compound model3 - for detection of insects over a broad range of insect densities. Although the double log and negative binomial models performed well under specific conditions, it is shown that, of the four models examined, the compound model performed the best over a broad range of insect spatial distributions and densities. In particular, this model predicted well the number of samples required when insect density was high and clumped within experimental storages. Conclusions: This paper reinforces the need for effective sampling programs designed to detect insects over a broad range of spatial distributions. The compound model is robust over a broad range of insect densities and leads to substantial improvement in detection probabilities within highly variable systems such as grain storage.
Resumo:
Operational modal analysis (OMA) is prevalent in modal identifi cation of civil structures. It asks for response measurements of the underlying structure under ambient loads. A valid OMA method requires the excitation be white noise in time and space. Although there are numerous applications of OMA in the literature, few have investigated the statistical distribution of a measurement and the infl uence of such randomness to modal identifi cation. This research has attempted modifi ed kurtosis to evaluate the statistical distribution of raw measurement data. In addition, a windowing strategy employing this index has been proposed to select quality datasets. In order to demonstrate how the data selection strategy works, the ambient vibration measurements of a laboratory bridge model and a real cable-stayed bridge have been respectively considered. The analysis incorporated with frequency domain decomposition (FDD) as the target OMA approach for modal identifi cation. The modal identifi cation results using the data segments with different randomness have been compared. The discrepancy in FDD spectra of the results indicates that, in order to fulfi l the assumption of an OMA method, special care shall be taken in processing a long vibration measurement data. The proposed data selection strategy is easy-to-apply and verifi ed effective in modal analysis.
Resumo:
Genomic DNA obtained from patient whole blood samples is a key element for genomic research. Advantages and disadvantages, in terms of time-efficiency, cost-effectiveness and laboratory requirements, of procedures available to isolate nucleic acids need to be considered before choosing any particular method. These characteristics have not been fully evaluated for some laboratory techniques, such as the salting out method for DNA extraction, which has been excluded from comparison in different studies published to date. We compared three different protocols (a traditional salting out method, a modified salting out method and a commercially available kit method) to determine the most cost-effective and time-efficient method to extract DNA. We extracted genomic DNA from whole blood samples obtained from breast cancer patient volunteers and compared the results of the product obtained in terms of quantity (concentration of DNA extracted and DNA obtained per ml of blood used) and quality (260/280 ratio and polymerase chain reaction product amplification) of the obtained yield. On average, all three methods showed no statistically significant differences between the final result, but when we accounted for time and cost derived for each method, they showed very significant differences. The modified salting out method resulted in a seven- and twofold reduction in cost compared to the commercial kit and traditional salting out method, respectively and reduced time from 3 days to 1 hour compared to the traditional salting out method. This highlights a modified salting out method as a suitable choice to be used in laboratories and research centres, particularly when dealing with a large number of samples.
Resumo:
Nitrous oxide emissions from soil are known to be spatially and temporally volatile. Reliable estimation of emissions over a given time and space depends on measuring with sufficient intensity but deciding on the number of measuring stations and the frequency of observation can be vexing. The question of low frequency manual observations providing comparable results to high frequency automated sampling also arises. Data collected from a replicated field experiment was intensively studied with the intention to give some statistically robust guidance on these issues. The experiment had nitrous oxide soil to air flux monitored within 10 m by 2.5 m plots by automated closed chambers under a 3 h average sampling interval and by manual static chambers under a three day average sampling interval over sixty days. Observed trends in flux over time by the static chambers were mostly within the auto chamber bounds of experimental error. Cumulated nitrous oxide emissions as measured by each system were also within error bounds. Under the temporal response pattern in this experiment, no significant loss of information was observed after culling the data to simulate results under various low frequency scenarios. Within the confines of this experiment observations from the manual chambers were not spatially correlated above distances of 1 m. Statistical power was therefore found to improve due to increased replicates per treatment or chambers per replicate. Careful after action review of experimental data can deliver savings for future work.
Resumo:
Results of an interlaboratory comparison on size characterization of SiO2 airborne nanoparticles using on-line and off-line measurement techniques are discussed. This study was performed in the framework of Technical Working Area (TWA) 34—“Properties of Nanoparticle Populations” of the Versailles Project on Advanced Materials and Standards (VAMAS) in the project no. 3 “Techniques for characterizing size distribution of airborne nanoparticles”. Two types of nano-aerosols, consisting of (1) one population of nanoparticles with a mean diameter between 30.3 and 39.0 nm and (2) two populations of non-agglomerated nanoparticles with mean diameters between, respectively, 36.2–46.6 nm and 80.2–89.8 nm, were generated for characterization measurements. Scanning mobility particle size spectrometers (SMPS) were used for on-line measurements of size distributions of the produced nano-aerosols. Transmission electron microscopy, scanning electron microscopy, and atomic force microscopy were used as off-line measurement techniques for nanoparticles characterization. Samples were deposited on appropriate supports such as grids, filters, and mica plates by electrostatic precipitation and a filtration technique using SMPS controlled generation upstream. The results of the main size distribution parameters (mean and mode diameters), obtained from several laboratories, were compared based on metrological approaches including metrological traceability, calibration, and evaluation of the measurement uncertainty. Internationally harmonized measurement procedures for airborne SiO2 nanoparticles characterization are proposed.
Resumo:
A significant amount of speech is typically required for speaker verification system development and evaluation, especially in the presence of large intersession variability. This paper introduces a source and utterance duration normalized linear discriminant analysis (SUN-LDA) approaches to compensate session variability in short-utterance i-vector speaker verification systems. Two variations of SUN-LDA are proposed where normalization techniques are used to capture source variation from both short and full-length development i-vectors, one based upon pooling (SUN-LDA-pooled) and the other on concatenation (SUN-LDA-concat) across the duration and source-dependent session variation. Both the SUN-LDA-pooled and SUN-LDA-concat techniques are shown to provide improvement over traditional LDA on NIST 08 truncated 10sec-10sec evaluation conditions, with the highest improvement obtained with the SUN-LDA-concat technique achieving a relative improvement of 8% in EER for mis-matched conditions and over 3% for matched conditions over traditional LDA approaches.
Resumo:
Reliability of the performance of biometric identity verification systems remains a significant challenge. Individual biometric samples of the same person (identity class) are not identical at each presentation and performance degradation arises from intra-class variability and inter-class similarity. These limitations lead to false accepts and false rejects that are dependent. It is therefore difficult to reduce the rate of one type of error without increasing the other. The focus of this dissertation is to investigate a method based on classifier fusion techniques to better control the trade-off between the verification errors using text-dependent speaker verification as the test platform. A sequential classifier fusion architecture that integrates multi-instance and multisample fusion schemes is proposed. This fusion method enables a controlled trade-off between false alarms and false rejects. For statistically independent classifier decisions, analytical expressions for each type of verification error are derived using base classifier performances. As this assumption may not be always valid, these expressions are modified to incorporate the correlation between statistically dependent decisions from clients and impostors. The architecture is empirically evaluated by applying the proposed architecture for text dependent speaker verification using the Hidden Markov Model based digit dependent speaker models in each stage with multiple attempts for each digit utterance. The trade-off between the verification errors is controlled using the parameters, number of decision stages (instances) and the number of attempts at each decision stage (samples), fine-tuned on evaluation/tune set. The statistical validation of the derived expressions for error estimates is evaluated on test data. The performance of the sequential method is further demonstrated to depend on the order of the combination of digits (instances) and the nature of repetitive attempts (samples). The false rejection and false acceptance rates for proposed fusion are estimated using the base classifier performances, the variance in correlation between classifier decisions and the sequence of classifiers with favourable dependence selected using the 'Sequential Error Ratio' criteria. The error rates are better estimated by incorporating user-dependent (such as speaker-dependent thresholds and speaker-specific digit combinations) and class-dependent (such as clientimpostor dependent favourable combinations and class-error based threshold estimation) information. The proposed architecture is desirable in most of the speaker verification applications such as remote authentication, telephone and internet shopping applications. The tuning of parameters - the number of instances and samples - serve both the security and user convenience requirements of speaker-specific verification. The architecture investigated here is applicable to verification using other biometric modalities such as handwriting, fingerprints and key strokes.
Resumo:
This thesis explored the knowledge and reasoning of young children in solving novel statistical problems, and the influence of problem context and design on their solutions. It found that young children's statistical competencies are underestimated, and that problem design and context facilitated children's application of a wide range of knowledge and reasoning skills, none of which had been taught. A qualitative design-based research method, informed by the Models and Modeling perspective (Lesh & Doerr, 2003) underpinned the study. Data modelling activities incorporating picture story books were used to contextualise the problems. Children applied real-world understanding to problem solving, including attribute identification, categorisation and classification skills. Intuitive and metarepresentational knowledge together with inductive and probabilistic reasoning was used to make sense of data, and beginning awareness of statistical variation and informal inference was visible.
A methodology to develop an urban transport disadvantage framework : the case of Brisbane, Australia
Resumo:
Most individuals travel in order to participate in a network of activities which are important for attaining a good standard of living. Because such activities are commonly widely dispersed and not located locally, regular access to a vehicle is important to avoid exclusion. However, planning transport system provisions that can engage members of society in an acceptable degree of activity participation remains a great challenge. The main challenges in most cities of the world are due to significant population growth and rapid urbanisation which produces increased demand for transport. Keeping pace with these challenges in most urban areas is difficult due to the widening gap between supply and demand for transport systems which places the urban population at a transport disadvantage. The key element in mitigating the issue of urban transport disadvantage is to accurately identify the urban transport disadvantaged. Although wide-ranging variables and multi-dimensional methods have been used to identify this group, variables are commonly selected using ad-hoc techniques and unsound methods. This poses questions of whether the current variables used are accurately linked with urban transport disadvantage, and the effectiveness of the current policies. To fill these gaps, the research conducted for this thesis develops an operational urban transport disadvantage framework (UTDAF) based on key statistical urban transport disadvantage variables to accurately identify the urban transport disadvantaged. The thesis develops a methodology based on qualitative and quantitative statistical approaches to develop an urban transport disadvantage framework designed to accurately identify urban transport disadvantage. The reliability and the applicability of the methodology developed is the prime concern rather than the accuracy of the estimations. Relevant concepts that impact on urban transport disadvantage identification and measurement and a wide range of urban transport disadvantage variables were identified through a review of the existing literature. Based on the reviews, a conceptual urban transport disadvantage framework was developed based on the causal theory. Variables identified during the literature review were selected and consolidated based on the recommendations of international and local experts during the Delphi study. Following the literature review, the conceptual urban transport disadvantage framework was statistically assessed to identify key variables. Using the statistical outputs, the key variables were weighted and aggregated to form the UTDAF. Before the variable's weights were finalised, they were adjusted based on results of correlation analysis between elements forming the framework to improve the framework's accuracy. The UTDAF was then applied to three contextual conditions to determine the framework's effectiveness in identifying urban transport disadvantage. The development of the framework is likely to be a robust application measure for policy makers to justify infrastructure investments and to generate awareness about the issue of urban transport disadvantage.