17 resultados para automated text classification
em Digital Commons at Florida International University
Resumo:
Flow Cytometry analyzers have become trusted companions due to their ability to perform fast and accurate analyses of human blood. The aim of these analyses is to determine the possible existence of abnormalities in the blood that have been correlated with serious disease states, such as infectious mononucleosis, leukemia, and various cancers. Though these analyzers provide important feedback, it is always desired to improve the accuracy of the results. This is evidenced by the occurrences of misclassifications reported by some users of these devices. It is advantageous to provide a pattern interpretation framework that is able to provide better classification ability than is currently available. Toward this end, the purpose of this dissertation was to establish a feature extraction and pattern classification framework capable of providing improved accuracy for detecting specific hematological abnormalities in flow cytometric blood data. ^ This involved extracting a unique and powerful set of shift-invariant statistical features from the multi-dimensional flow cytometry data and then using these features as inputs to a pattern classification engine composed of an artificial neural network (ANN). The contribution of this method consisted of developing a descriptor matrix that can be used to reliably assess if a donor’s blood pattern exhibits a clinically abnormal level of variant lymphocytes, which are blood cells that are potentially indicative of disorders such as leukemia and infectious mononucleosis. ^ This study showed that the set of shift-and-rotation-invariant statistical features extracted from the eigensystem of the flow cytometric data pattern performs better than other commonly-used features in this type of disease detection, exhibiting an accuracy of 80.7%, a sensitivity of 72.3%, and a specificity of 89.2%. This performance represents a major improvement for this type of hematological classifier, which has historically been plagued by poor performance, with accuracies as low as 60% in some cases. This research ultimately shows that an improved feature space was developed that can deliver improved performance for the detection of variant lymphocytes in human blood, thus providing significant utility in the realm of suspect flagging algorithms for the detection of blood-related diseases.^
Resumo:
This study explores factors related to the prompt difficulty in Automated Essay Scoring. The sample was composed of 6,924 students. For each student, there were 1-4 essays, across 20 different writing prompts, for a total of 20,243 essays. E-rater® v.2 essay scoring engine developed by the Educational Testing Service was used to score the essays. The scoring engine employs a statistical model that incorporates 10 predictors associated with writing characteristics of which 8 were used. The Rasch partial credit analysis was applied to the scores to determine the difficulty levels of prompts. In addition, the scores were used as outcomes in the series of hierarchical linear models (HLM) in which students and prompts constituted the cross-classification levels. This methodology was used to explore the partitioning of the essay score variance.^ The results indicated significant differences in prompt difficulty levels due to genre. Descriptive prompts, as a group, were found to be more difficult than the persuasive prompts. In addition, the essay score variance was partitioned between students and prompts. The amount of the essay score variance that lies between prompts was found to be relatively small (4 to 7 percent). When the essay-level, student-level-and prompt-level predictors were included in the model, it was able to explain almost all variance that lies between prompts. Since in most high-stakes writing assessments only 1-2 prompts per students are used, the essay score variance that lies between prompts represents an undesirable or "noise" variation. Identifying factors associated with this "noise" variance may prove to be important for prompt writing and for constructing Automated Essay Scoring mechanisms for weighting prompt difficulty when assigning essay score.^
Resumo:
The need to provide computers with the ability to distinguish the affective state of their users is a major requirement for the practical implementation of affective computing concepts. This dissertation proposes the application of signal processing methods on physiological signals to extract from them features that can be processed by learning pattern recognition systems to provide cues about a person's affective state. In particular, combining physiological information sensed from a user's left hand in a non-invasive way with the pupil diameter information from an eye-tracking system may provide a computer with an awareness of its user's affective responses in the course of human-computer interactions. In this study an integrated hardware-software setup was developed to achieve automatic assessment of the affective status of a computer user. A computer-based "Paced Stroop Test" was designed as a stimulus to elicit emotional stress in the subject during the experiment. Four signals: the Galvanic Skin Response (GSR), the Blood Volume Pulse (BVP), the Skin Temperature (ST) and the Pupil Diameter (PD), were monitored and analyzed to differentiate affective states in the user. Several signal processing techniques were applied on the collected signals to extract their most relevant features. These features were analyzed with learning classification systems, to accomplish the affective state identification. Three learning algorithms: Naïve Bayes, Decision Tree and Support Vector Machine were applied to this identification process and their levels of classification accuracy were compared. The results achieved indicate that the physiological signals monitored do, in fact, have a strong correlation with the changes in the emotional states of the experimental subjects. These results also revealed that the inclusion of pupil diameter information significantly improved the performance of the emotion recognition system. ^
Resumo:
This dissertation develops a new figure of merit to measure the similarity (or dissimilarity) of Gaussian distributions through a novel concept that relates the Fisher distance to the percentage of data overlap. The derivations are expanded to provide a generalized mathematical platform for determining an optimal separating boundary of Gaussian distributions in multiple dimensions. Real-world data used for implementation and in carrying out feasibility studies were provided by Beckman-Coulter. It is noted that although the data used is flow cytometric in nature, the mathematics are general in their derivation to include other types of data as long as their statistical behavior approximate Gaussian distributions. ^ Because this new figure of merit is heavily based on the statistical nature of the data, a new filtering technique is introduced to accommodate for the accumulation process involved with histogram data. When data is accumulated into a frequency histogram, the data is inherently smoothed in a linear fashion, since an averaging effect is taking place as the histogram is generated. This new filtering scheme addresses data that is accumulated in the uneven resolution of the channels of the frequency histogram. ^ The qualitative interpretation of flow cytometric data is currently a time consuming and imprecise method for evaluating histogram data. This method offers a broader spectrum of capabilities in the analysis of histograms, since the figure of merit derived in this dissertation integrates within its mathematics both a measure of similarity and the percentage of overlap between the distributions under analysis. ^
Resumo:
This research is to establish new optimization methods for pattern recognition and classification of different white blood cells in actual patient data to enhance the process of diagnosis. Beckman-Coulter Corporation supplied flow cytometry data of numerous patients that are used as training sets to exploit the different physiological characteristics of the different samples provided. The methods of Support Vector Machines (SVM) and Artificial Neural Networks (ANN) were used as promising pattern classification techniques to identify different white blood cell samples and provide information to medical doctors in the form of diagnostic references for the specific disease states, leukemia. The obtained results prove that when a neural network classifier is well configured and trained with cross-validation, it can perform better than support vector classifiers alone for this type of data. Furthermore, a new unsupervised learning algorithm---Density based Adaptive Window Clustering algorithm (DAWC) was designed to process large volumes of data for finding location of high data cluster in real-time. It reduces the computational load to ∼O(N) number of computations, and thus making the algorithm more attractive and faster than current hierarchical algorithms.
Resumo:
This dissertation introduces a novel automated book reader as an assistive technology tool for persons with blindness. The literature shows extensive work in the area of optical character recognition, but the current methodologies available for the automated reading of books or bound volumes remain inadequate and are severely constrained during document scanning or image acquisition processes. The goal of the book reader design is to automate and simplify the task of reading a book while providing a user-friendly environment with a realistic but affordable system design. This design responds to the main concerns of (a) providing a method of image acquisition that maintains the integrity of the source (b) overcoming optical character recognition errors created by inherent imaging issues such as curvature effects and barrel distortion, and (c) determining a suitable method for accurate recognition of characters that yields an interface with the ability to read from any open book with a high reading accuracy nearing 98%. This research endeavor focuses in its initial aim on the development of an assistive technology tool to help persons with blindness in the reading of books and other bound volumes. But its secondary and broader aim is to also find in this design the perfect platform for the digitization process of bound documentation in line with the mission of the Open Content Alliance (OCA), a nonprofit Alliance at making reading materials available in digital form. The theoretical perspective of this research relates to the mathematical developments that are made in order to resolve both the inherent distortions due to the properties of the camera lens and the anticipated distortions of the changing page curvature as one leafs through the book. This is evidenced by the significant increase of the recognition rate of characters and a high accuracy read-out through text to speech processing. This reasonably priced interface with its high performance results and its compatibility to any computer or laptop through universal serial bus connectors extends greatly the prospects for universal accessibility to documentation.
Resumo:
Weakly electric fish produce a dual function electric signal that makes them ideal models for the study of sensory computation and signal evolution. This signal, the electric organ discharge (EOD), is used for communication and navigation. In some families of gymnotiform electric fish, the EOD is a dynamic signal that increases in amplitude during social interactions. Amplitude increase could facilitate communication by increasing the likelihood of being sensed by others or by impressing prospective mates or rivals. Conversely, by increasing its signal amplitude a fish might increase its sensitivity to objects by lowering its electrolocation detection threshold. To determine how EOD modulations elicited in the social context affect electrolocation, I developed an automated and fast method for measuring electroreception thresholds using a classical conditioning paradigm. This method employs a moving shelter tube, which these fish occupy at rest during the day, paired with an electrical stimulus. A custom built and programmed robotic system presents the electrical stimulus to the fish, slides the shelter tube requiring them to follow, and records video of their movements. I trained the electric fish of the genus Sternopygus was trained to respond to a resistive stimulus on this apparatus in 2 days. The motion detection algorithm correctly identifies the responses 91% of the time, with a false positive rate of only 4%. This system allows for a large number of trials, decreasing the amount of time needed to determine behavioral electroreception thresholds. This novel method enables the evaluation the evolutionary interplay between two conflicting sensory forces, social communication and navigation.
Resumo:
Physiological signals, which are controlled by the autonomic nervous system (ANS), could be used to detect the affective state of computer users and therefore find applications in medicine and engineering. The Pupil Diameter (PD) seems to provide a strong indication of the affective state, as found by previous research, but it has not been investigated fully yet. ^ In this study, new approaches based on monitoring and processing the PD signal for off-line and on-line affective assessment ("relaxation" vs. "stress") are proposed. Wavelet denoising and Kalman filtering methods are first used to remove abrupt changes in the raw Pupil Diameter (PD) signal. Then three features (PDmean, PDmax and PDWalsh) are extracted from the preprocessed PD signal for the affective state classification. In order to select more relevant and reliable physiological data for further analysis, two types of data selection methods are applied, which are based on the paired t-test and subject self-evaluation, respectively. In addition, five different kinds of the classifiers are implemented on the selected data, which achieve average accuracies up to 86.43% and 87.20%, respectively. Finally, the receiver operating characteristic (ROC) curve is utilized to investigate the discriminating potential of each individual feature by evaluation of the area under the ROC curve, which reaches values above 0.90. ^ For the on-line affective assessment, a hard threshold is implemented first in order to remove the eye blinks from the PD signal and then a moving average window is utilized to obtain the representative value PDr for every one-second time interval of PD. There are three main steps for the on-line affective assessment algorithm, which are preparation, feature-based decision voting and affective determination. The final results show that the accuracies are 72.30% and 73.55% for the data subsets, which were respectively chosen using two types of data selection methods (paired t-test and subject self-evaluation). ^ In order to further analyze the efficiency of affective recognition through the PD signal, the Galvanic Skin Response (GSR) was also monitored and processed. The highest affective assessment classification rate obtained from GSR processing is only 63.57% (based on the off-line processing algorithm). The overall results confirm that the PD signal should be considered as one of the most powerful physiological signals to involve in future automated real-time affective recognition systems, especially for detecting the "relaxation" vs. "stress" states.^
Resumo:
In the wake of the “9-11” terrorists' attacks, the U.S. Government has turned to information technology (IT) to address a lack of information sharing among law enforcement agencies. This research determined if and how information-sharing technology helps law enforcement by examining the differences in perception of the value of IT between law enforcement officers who have access to automated regional information sharing and those who do not. It also examined the effect of potential intervening variables such as user characteristics, training, and experience, on the officers' evaluation of IT. The sample was limited to 588 officers from two sheriff's offices; one of them (the study group) uses information sharing technology, the other (the comparison group) does not. Triangulated methodologies included surveys, interviews, direct observation, and a review of agency records. Data analysis involved the following statistical methods: descriptive statistics, Chi-Square, factor analysis, principal component analysis, Cronbach's Alpha, Mann-Whitney tests, analysis of variance (ANOVA), and Scheffe' post hoc analysis. ^ Results indicated a significant difference between groups: the study group perceived information sharing technology as being a greater factor in solving crime and in increasing officer productivity. The study group was more satisfied with the data available to it. As to the number of arrests made, information sharing technology did not make a difference. Analysis of the potential intervening variables revealed several remarkable results. The presence of a strong performance management imperative (in the comparison sheriff's office) appeared to be a factor in case clearances and arrests, technology notwithstanding. As to the influence of user characteristics, level of education did not influence a user's satisfaction with technology, but user-satisfaction scores differed significantly among years of experience as a law enforcement officer and the amount of computer training, suggesting a significant but weak relationship. ^ Therefore, this study finds that information sharing technology assists law enforcement officers in doing their jobs. It also suggests that other variables such as computer training, experience, and management climate should be accounted for when assessing the impact of information technology. ^
Biogeochemical Classification of South Florida’s Estuarine and Coastal Waters of Tropical Seagrasses
Resumo:
South Florida’s watersheds have endured a century of urban and agricultural development and disruption of their hydrology. Spatial characterization of South Florida’s estuarine and coastal waters is important to Everglades’ restoration programs. We applied Factor Analysis and Hierarchical Clustering of water quality data in tandem to characterize and spatially subdivide South Florida’s coastal and estuarine waters. Segmentation rendered forty-four biogeochemically distinct water bodies whose spatial distribution is closely linked to geomorphology, circulation, benthic community pattern, and to water management. This segmentation has been adopted with minor changes by federal and state environmental agencies to derive numeric nutrient criteria.
Resumo:
A comprehensive, broadly accepted vegetation classification is important for ecosystem management, particularly for planning and monitoring. South Florida vegetation classification systems that are currently in use were largely arrived at subjectively and intuitively with the involvement of experienced botanical observers and ecologists, but with little support in terms of quantitative field data. The need to develop a field data-driven classification of South Florida vegetation that builds on the ecological organization has been recognized by the National Park Service and vegetation practitioners in the region. The present work, funded by the National Park Service Inventory and Monitoring Program - South Florida/Caribbean Network (SFCN), covers the first stage of a larger project whose goal is to apply extant vegetation data to test, and revise as necessary, an existing, widely used classification (Rutchey et al. 2006). The objectives of the first phase of the project were (1) to identify useful existing datasets, (2) to collect these data and compile them into a geodatabase, (3) to conduct an initial classification analysis of marsh sites, and (4) to design a strategy for augmenting existing information from poorly represented landscapes in order to develop a more comprehensive south Florida classification.
Resumo:
The purpose of this project was to evaluate the use of remote sensing 1) to detect and map Everglades wetland plant communities at different scales; and 2) to compare map products delineated and resampled at various scales with the intent to quantify and describe the quantitative and qualitative differences between such products. We evaluated data provided by Digital Globe’s WorldView 2 (WV2) sensor with a spatial resolution of 2m and data from Landsat’s Thematic and Enhanced Thematic Mapper (TM and ETM+) sensors with a spatial resolution of 30m. We were also interested in the comparability and scalability of products derived from these data sources. The adequacy of each data set to map wetland plant communities was evaluated utilizing two metrics: 1) model-based accuracy estimates of the classification procedures; and 2) design-based post-classification accuracy estimates of derived maps.
Resumo:
Software Engineering is one of the most widely researched areas of Computer Science. The ability to reuse software, much like reuse of hardware components is one of the key issues in software development. The object-oriented programming methodology is revolutionary in that it promotes software reusability. This thesis describes the development of a tool that helps programmers to design and implement software from within the Smalltalk Environment (an Object- Oriented programming environment). The ASDN tool is part of the PEREAM (Programming Environment for the Reuse and Evolution of Abstract Models) system, which advocates incremental development of software. The Asdn tool along with the PEREAM system seeks to enhance the Smalltalk programming environment by providing facilities for structured development of abstractions (concepts). It produces a document that describes the abstractions that are developed using this tool. The features of the ASDN tool are illustrated by an example.
Resumo:
The objective in this work is to build a rapid and automated numerical design method that makes optimal design of robots possible. In this work, two classes of optimal robot design problems were specifically addressed: (1) When the objective is to optimize a pre-designed robot, and (2) when the goal is to design an optimal robot from scratch. In the first case, to reach the optimum design some of the critical dimensions or specific measures to optimize (design parameters) are varied within an established range. Then the stress is calculated as a function of the design parameter(s), the design parameter(s) that optimizes a pre-determined performance index provides the optimum design. In the second case, this work focuses on the development of an automated procedure for the optimal design of robotic systems. For this purpose, Pro/Engineer© and MatLab© software packages are integrated to draw the robot parts, optimize them, and then re-draw the optimal system parts.