991 resultados para Machine Typed Document
Resumo:
Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000. The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built. In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.
Resumo:
Heart rate variability (HRV) refers to the regulation of the sinoatrial node, the natural pacemaker of the heart by the sympathetic and parasympathetic branches of the autonomic nervous system. HRV analysis is an important tool to observe the heart’s ability to respond to normal regulatory impulses that affect its rhythm. Like many bio-signals, HRV signals are non-linear in nature. Higher order spectral analysis (HOS) is known to be a good tool for the analysis of non-linear systems and provides good noise immunity. A computer-based arrhythmia detection system of cardiac states is very useful in diagnostics and disease management. In this work, we studied the identification of the HRV signals using features derived from HOS. These features were fed to the support vector machine (SVM) for classification. Our proposed system can classify the normal and other four classes of arrhythmia with an average accuracy of more than 85%.
Resumo:
Problem addressed Wrist-worn accelerometers are associated with greater compliance. However, validated algorithms for predicting activity type from wrist-worn accelerometer data are lacking. This study compared the activity recognition rates of an activity classifier trained on acceleration signal collected on the wrist and hip. Methodology 52 children and adolescents (mean age 13.7 +/- 3.1 year) completed 12 activity trials that were categorized into 7 activity classes: lying down, sitting, standing, walking, running, basketball, and dancing. During each trial, participants wore an ActiGraph GT3X+ tri-axial accelerometer on the right hip and the non-dominant wrist. Features were extracted from 10-s windows and inputted into a regularized logistic regression model using R (Glmnet + L1). Results Classification accuracy for the hip and wrist was 91.0% +/- 3.1% and 88.4% +/- 3.0%, respectively. The hip model exhibited excellent classification accuracy for sitting (91.3%), standing (95.8%), walking (95.8%), and running (96.8%); acceptable classification accuracy for lying down (88.3%) and basketball (81.9%); and modest accuracy for dance (64.1%). The wrist model exhibited excellent classification accuracy for sitting (93.0%), standing (91.7%), and walking (95.8%); acceptable classification accuracy for basketball (86.0%); and modest accuracy for running (78.8%), lying down (74.6%) and dance (69.4%). Potential Impact Both the hip and wrist algorithms achieved acceptable classification accuracy, allowing researchers to use either placement for activity recognition.
Resumo:
The use of ‘topic’ concepts has shown improved search performance, given a query, by bringing together relevant documents which use different terms to describe a higher level concept. In this paper, we propose a method for discovering and utilizing concepts in indexing and search for a domain specific document collection being utilized in industry. This approach differs from others in that we only collect focused concepts to build the concept space and that instead of turning a user’s query into a concept based query, we experiment with different techniques of combining the original query with a concept query. We apply the proposed approach to a real-world document collection and the results show that in this scenario the use of concept knowledge at index and search can improve the relevancy of results.
Resumo:
Automated remote ultrasound detectors allow large amounts of data on bat presence and activity to be collected. Processing of such data involves identifying bat species from their echolocation calls. Automated species identification has the potential to provide more consistent, predictable, and potentially higher levels of accuracy than identification by humans. In contrast, identification by humans permits flexibility and intelligence in identification, as well as the incorporation of features and patterns that may be difficult to quantify. We compared humans with artificial neural networks (ANNs) in their ability to classify short recordings of bat echolocation calls of variable signal to noise ratios; these sequences are typical of those obtained from remote automated recording systems that are often used in large-scale ecological studies. We presented 45 recordings (1–4 calls) produced by known species of bats to ANNs and to 26 human participants with 1 month to 23 years of experience in acoustic identification of bats. Humans correctly classified 86% of recordings to genus and 56% to species; ANNs correctly identified 92% and 62%, respectively. There was no significant difference between the performance of ANNs and that of humans, but ANNs performed better than about 75% of humans. There was little relationship between the experience of the human participants and their classification rate. However, humans with <1 year of experience performed worse than others. Currently, identification of bat echolocation calls by humans is suitable for ecological research, after careful consideration of biases. However, improvements to ANNs and the data that they are trained on may in future increase their performance to beyond those demonstrated by humans.
Resumo:
Background The use of dual growing rods is a fusionless surgical approach to the treatment of early onset scoliosis (EOS), which aims of harness potential growth in order to correct spinal deformity. The purpose of this study was to compare the in-vitro biomechanical response of two different dual rod designs under axial rotation loading. Methods Six porcine spines were dissected into seven level thoracolumbar multi-segmental units. Each specimen was mounted and tested in a biaxial Instron machine, undergoing nondestructive left/right axial rotation to peak moments of 4Nm at a constant rotation rate of 8deg.s-1. A motion tracking system (Optotrak) measured 3D displacements of individual vertebrae. Each spine was tested in an un-instrumented state first and then with appropriately sized semi-constrained growing rods and ‘rigid’ rods in alternating sequence. Range of motion, neutral zone size and stiffness were calculated from the moment-rotation curves and intervertebral ranges of motion were calculated from Optotrak data. Findings Irrespective of test sequence, rigid rods showed significantly reduction of total rotation across all instrumented levels (with increased stiffness) whilst semi-constrained rods exhibited similar rotation behavior to the un-instrumented (P<0.05). An 11% and 8% increase in stiffness for left and right axial rotation respectively and 15% reduction in total range of motion was recorded with dual rigid rods compared with semi-constrained rods. Interpretation Based on these findings, the semi-constrained growing rods do not increase axial rotation stiffness compared with un-instrumented spines. This is thought to provide a more physiological environment for the growing spine compared to dual rigid rod constructs.
Resumo:
Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.
Resumo:
The commercialization of aerial image processing is highly dependent on the platforms such as UAVs (Unmanned Aerial Vehicles). However, the lack of an automated UAV forced landing site detection system has been identified as one of the main impediments to allow UAV flight over populated areas in civilian airspace. This article proposes a UAV forced landing site detection system that is based on machine learning approaches including the Gaussian Mixture Model and the Support Vector Machine. A range of learning parameters are analysed including the number of Guassian mixtures, support vector kernels including linear, radial basis function Kernel (RBF) and polynormial kernel (poly), and the order of RBF kernel and polynormial kernel. Moreover, a modified footprint operator is employed during feature extraction to better describe the geometric characteristics of the local area surrounding a pixel. The performance of the presented system is compared to a baseline UAV forced landing site detection system which uses edge features and an Artificial Neural Network (ANN) region type classifier. Experiments conducted on aerial image datasets captured over typical urban environments reveal improved landing site detection can be achieved with an SVM classifier with an RBF kernel using a combination of colour and texture features. Compared to the baseline system, the proposed system provides significant improvement in term of the chance to detect a safe landing area, and the performance is more stable than the baseline in the presence of changes to the UAV altitude.
Resumo:
For traditional information filtering (IF) models, it is often assumed that the documents in one collection are only related to one topic. However, in reality users’ interests can be diverse and the documents in the collection often involve multiple topics. Topic modelling was proposed to generate statistical models to represent multiple topics in a collection of documents, but in a topic model, topics are represented by distributions over words which are limited to distinctively represent the semantics of topics. Patterns are always thought to be more discriminative than single terms and are able to reveal the inner relations between words. This paper proposes a novel information filtering model, Significant matched Pattern-based Topic Model (SPBTM). The SPBTM represents user information needs in terms of multiple topics and each topic is represented by patterns. More importantly, the patterns are organized into groups based on their statistical and taxonomic features, from which the more representative patterns, called Significant Matched Patterns, can be identified and used to estimate the document relevance. Experiments on benchmark data sets demonstrate that the SPBTM significantly outperforms the state-of-the-art models.
Resumo:
This thesis is concerned with the detection and prediction of rain in environmental recordings using different machine learning algorithms. The results obtained in this research will help ecologists to efficiently analyse environmental data and monitor biodiversity.
Resumo:
The mining industry is highly suitable for the application of robotics and automation technology since the work is arduous, dangerous and often repetitive. This paper describes the development of an automation system for a physically large and complex field robotic system - a 3,500 tonne mining machine (a dragline). The major components of the system are discussed with a particular emphasis on the machine/operator interface. A very important aspect of this system is that it must work cooperatively with a human operator, seamlessly passing the control back and forth in order to achieve the main aim - increased productivity.
Resumo:
Neural interface devices and the melding of mind and machine, challenge the law in determining where civil liability for injury, damage or loss should lie. The ability of the human mind to instruct and control these devices means that in a negligence action against a person with a neural interface device, determining the standard of care owed by him or her will be of paramount importance. This article considers some of the factors that may influence the court’s determination of the appropriate standard of care to be applied in this situation, leading to the conclusion that a new standard of care might evolve.
Resumo:
The proliferation of the web presents an unsolved problem of automatically analyzing billions of pages of natural language. We introduce a scalable algorithm that clusters hundreds of millions of web pages into hundreds of thousands of clusters. It does this on a single mid-range machine using efficient algorithms and compressed document representations. It is applied to two web-scale crawls covering tens of terabytes. ClueWeb09 and ClueWeb12 contain 500 and 733 million web pages and were clustered into 500,000 to 700,000 clusters. To the best of our knowledge, such fine grained clustering has not been previously demonstrated. Previous approaches clustered a sample that limits the maximum number of discoverable clusters. The proposed EM-tree algorithm uses the entire collection in clustering and produces several orders of magnitude more clusters than the existing algorithms. Fine grained clustering is necessary for meaningful clustering in massive collections where the number of distinct topics grows linearly with collection size. These fine-grained clusters show an improved cluster quality when assessed with two novel evaluations using ad hoc search relevance judgments and spam classifications for external validation. These evaluations solve the problem of assessing the quality of clusters where categorical labeling is unavailable and unfeasible.