19 resultados para Data Accuracy
em Digital Commons at Florida International University
Resumo:
Correct specification of the simple location quotients in regionalizing the national direct requirements table is essential to the accuracy of regional input-output multipliers. The purpose of this research is to examine the relative accuracy of these multipliers when earnings, employment, number of establishments, and payroll data specify the simple location quotients.^ For each specification type, I derive a column of total output multipliers and a column of total income multipliers. These multipliers are based on the 1987 benchmark input-output accounts of the U.S. economy and 1988-1992 state of Florida data.^ Error sign tests, and Standardized Mean Absolute Deviation (SMAD) statistics indicate that the output multiplier estimates overestimate the output multipliers published by the Department of Commerce-Bureau of Economic Analysis (BEA) for the state of Florida. In contrast, the income multiplier estimates underestimate the BEA's income multipliers. For a given multiplier type, the Spearman-rank correlation analysis shows that the multiplier estimates and the BEA multipliers have statistically different rank ordering of row elements. The above tests also find no significant different differences, both in size and ranking distributions, among the vectors of multiplier estimates. ^
Resumo:
Methods for accessing data on the Web have been the focus of active research over the past few years. In this thesis we propose a method for representing Web sites as data sources. We designed a Data Extractor data retrieval solution that allows us to define queries to Web sites and process resulting data sets. Data Extractor is being integrated into the MSemODB heterogeneous database management system. With its help database queries can be distributed over both local and Web data sources within MSemODB framework. ^ Data Extractor treats Web sites as data sources, controlling query execution and data retrieval. It works as an intermediary between the applications and the sites. Data Extractor utilizes a twofold “custom wrapper” approach for information retrieval. Wrappers for the majority of sites are easily built using a powerful and expressive scripting language, while complex cases are processed using Java-based wrappers that utilize specially designed library of data retrieval, parsing and Web access routines. In addition to wrapper development we thoroughly investigate issues associated with Web site selection, analysis and processing. ^ Data Extractor is designed to act as a data retrieval server, as well as an embedded data retrieval solution. We also use it to create mobile agents that are shipped over the Internet to the client's computer to perform data retrieval on behalf of the user. This approach allows Data Extractor to distribute and scale well. ^ This study confirms feasibility of building custom wrappers for Web sites. This approach provides accuracy of data retrieval, and power and flexibility in handling of complex cases. ^
Resumo:
This dissertation develops a new mathematical approach that overcomes the effect of a data processing phenomenon known as “histogram binning” inherent to flow cytometry data. A real-time procedure is introduced to prove the effectiveness and fast implementation of such an approach on real-world data. The histogram binning effect is a dilemma posed by two seemingly antagonistic developments: (1) flow cytometry data in its histogram form is extended in its dynamic range to improve its analysis and interpretation, and (2) the inevitable dynamic range extension introduces an unwelcome side effect, the binning effect, which skews the statistics of the data, undermining as a consequence the accuracy of the analysis and the eventual interpretation of the data. ^ Researchers in the field contended with such a dilemma for many years, resorting either to hardware approaches that are rather costly with inherent calibration and noise effects; or have developed software techniques based on filtering the binning effect but without successfully preserving the statistical content of the original data. ^ The mathematical approach introduced in this dissertation is so appealing that a patent application has been filed. The contribution of this dissertation is an incremental scientific innovation based on a mathematical framework that will allow researchers in the field of flow cytometry to improve the interpretation of data knowing that its statistical meaning has been faithfully preserved for its optimized analysis. Furthermore, with the same mathematical foundation, proof of the origin of such an inherent artifact is provided. ^ These results are unique in that new mathematical derivations are established to define and solve the critical problem of the binning effect faced at the experimental assessment level, providing a data platform that preserves its statistical content. ^ In addition, a novel method for accumulating the log-transformed data was developed. This new method uses the properties of the transformation of statistical distributions to accumulate the output histogram in a non-integer and multi-channel fashion. Although the mathematics of this new mapping technique seem intricate, the concise nature of the derivations allow for an implementation procedure that lends itself to a real-time implementation using lookup tables, a task that is also introduced in this dissertation. ^
Resumo:
Historical accuracy is only one of the components of a scholarly college textbook used to teach the history of jazz music. Textbooks in this field should include accurate ethnic representation of the most important musical figures as jazz is considered the only original American art form. As college and universities celebrate diversity, it is important that jazz history be accurate and complete. ^ The purpose of this study was to examine the content of the most commonly used jazz history textbooks currently used at American colleges and universities. This qualitative study utilized grounded and textual analysis to explore the existence of ethnic representation in these texts. The methods used were modeled after the work of Kane and Selden each of whom conducted a content analysis focused on a limited field of study. This study is focused on key jazz artists and composers whose work was created in the periods of early jazz (1915-1930), swing (1930-1945) and modern jazz (1945-1960). ^ This study considered jazz notables within the texts in terms of ethnic representation, authors' use of language, contributions to the jazz canon, and place in the standard jazz repertoire. Appropriate historical sections of the selected texts were reviewed and coded using predetermined rubrics. Data were then aggregated into categories and then analyzed according to the character assigned to the key jazz personalities noted in the text as well as the comparative standing afforded each personality. ^ The results of this study demonstrate that particular key African-American jazz artists and composers occupy a significant place in these texts while other significant individuals representing other ethnic groups are consistently overlooked. This finding suggests that while America and the world celebrates the quality of the product of American jazz as great musically and significant socially, many ethnic contributors are not mentioned with the result being a less than complete picture of the evolution of this American art form. ^
Resumo:
The nation's freeway systems are becoming increasingly congested. A major contribution to traffic congestion on freeways is due to traffic incidents. Traffic incidents are non-recurring events such as accidents or stranded vehicles that cause a temporary roadway capacity reduction, and they can account for as much as 60 percent of all traffic congestion on freeways. One major freeway incident management strategy involves diverting traffic to avoid incident locations by relaying timely information through Intelligent Transportation Systems (ITS) devices such as dynamic message signs or real-time traveler information systems. The decision to divert traffic depends foremost on the expected duration of an incident, which is difficult to predict. In addition, the duration of an incident is affected by many contributing factors. Determining and understanding these factors can help the process of identifying and developing better strategies to reduce incident durations and alleviate traffic congestion. A number of research studies have attempted to develop models to predict incident durations, yet with limited success. ^ This dissertation research attempts to improve on this previous effort by applying data mining techniques to a comprehensive incident database maintained by the District 4 ITS Office of the Florida Department of Transportation (FDOT). Two categories of incident duration prediction models were developed: "offline" models designed for use in the performance evaluation of incident management programs, and "online" models for real-time prediction of incident duration to aid in the decision making of traffic diversion in the event of an ongoing incident. Multiple data mining analysis techniques were applied and evaluated in the research. The multiple linear regression analysis and decision tree based method were applied to develop the offline models, and the rule-based method and a tree algorithm called M5P were used to develop the online models. ^ The results show that the models in general can achieve high prediction accuracy within acceptable time intervals of the actual durations. The research also identifies some new contributing factors that have not been examined in past studies. As part of the research effort, software code was developed to implement the models in the existing software system of District 4 FDOT for actual applications. ^
Resumo:
This study investigated the effects of self-monitoring on the homework completion and accuracy rates of four, fourth-grade students with disabilities in an inclusive general education classroom. A multiple baseline across subjects design was utilized to examine four dependent variables: completion of spelling homework, accuracy of spelling homework, completion of math homework, accuracy of math homework. Data were collected and analyzed during baseline, three phases of intervention, and maintenance. ^ Throughout baseline and all phases, participants followed typical classroom procedures, brought their homework to school each day and gave it to the general education teacher. During Phase I of the intervention, participants self-monitored with a daily sheet at home and on the computer at school in the morning using KidTools (Fitzgerald & Koury, 2003); a student friendly, self-monitoring program. They also participated in brief daily conferences to review their self-monitoring sheets with the investigator, their special education teacher. Phase II followed the same steps except conferencing was reduced to two days a week, which were randomly selected by the researcher and Phase III conferencing was one random day a week. Maintenance data were taken over a two-to-three week period subsequent to the end of the intervention. ^ Results of this study demonstrated self-monitoring substantially improved spelling and math homework completion and accuracy rates of students with disabilities in an inclusive, general education classroom. On average, completion and accuracy rates were highest over baseline in Phase III. Self-monitoring led to higher percentages of completion and accuracy during each phase of the intervention compared to baseline, group percentages also rose slightly during maintenance. Therefore, results suggest self-monitoring leads to short-term maintenance in spelling and math homework completion and accuracy. ^ This study adds to the existing literature by investigating the effects of self-monitoring of homework for students with disabilities included in general education classrooms. Future research should consider selecting participants with other demographic characteristics, using peers for conferencing instead of the teacher, and the use of self-monitoring with other academic subjects (e.g., science, history). Additionally, future research could investigate the effects of each of the two self-monitoring components used alone, with or without the conferencing.^
Resumo:
Flow Cytometry analyzers have become trusted companions due to their ability to perform fast and accurate analyses of human blood. The aim of these analyses is to determine the possible existence of abnormalities in the blood that have been correlated with serious disease states, such as infectious mononucleosis, leukemia, and various cancers. Though these analyzers provide important feedback, it is always desired to improve the accuracy of the results. This is evidenced by the occurrences of misclassifications reported by some users of these devices. It is advantageous to provide a pattern interpretation framework that is able to provide better classification ability than is currently available. Toward this end, the purpose of this dissertation was to establish a feature extraction and pattern classification framework capable of providing improved accuracy for detecting specific hematological abnormalities in flow cytometric blood data. ^ This involved extracting a unique and powerful set of shift-invariant statistical features from the multi-dimensional flow cytometry data and then using these features as inputs to a pattern classification engine composed of an artificial neural network (ANN). The contribution of this method consisted of developing a descriptor matrix that can be used to reliably assess if a donor’s blood pattern exhibits a clinically abnormal level of variant lymphocytes, which are blood cells that are potentially indicative of disorders such as leukemia and infectious mononucleosis. ^ This study showed that the set of shift-and-rotation-invariant statistical features extracted from the eigensystem of the flow cytometric data pattern performs better than other commonly-used features in this type of disease detection, exhibiting an accuracy of 80.7%, a sensitivity of 72.3%, and a specificity of 89.2%. This performance represents a major improvement for this type of hematological classifier, which has historically been plagued by poor performance, with accuracies as low as 60% in some cases. This research ultimately shows that an improved feature space was developed that can deliver improved performance for the detection of variant lymphocytes in human blood, thus providing significant utility in the realm of suspect flagging algorithms for the detection of blood-related diseases.^
Resumo:
The accurate and reliable estimation of travel time based on point detector data is needed to support Intelligent Transportation System (ITS) applications. It has been found that the quality of travel time estimation is a function of the method used in the estimation and varies for different traffic conditions. In this study, two hybrid on-line travel time estimation models, and their corresponding off-line methods, were developed to achieve better estimation performance under various traffic conditions, including recurrent congestion and incidents. The first model combines the Mid-Point method, which is a speed-based method, with a traffic flow-based method. The second model integrates two speed-based methods: the Mid-Point method and the Minimum Speed method. In both models, the switch between travel time estimation methods is based on the congestion level and queue status automatically identified by clustering analysis. During incident conditions with rapidly changing queue lengths, shock wave analysis-based refinements are applied for on-line estimation to capture the fast queue propagation and recovery. Travel time estimates obtained from existing speed-based methods, traffic flow-based methods, and the models developed were tested using both simulation and real-world data. The results indicate that all tested methods performed at an acceptable level during periods of low congestion. However, their performances vary with an increase in congestion. Comparisons with other estimation methods also show that the developed hybrid models perform well in all cases. Further comparisons between the on-line and off-line travel time estimation methods reveal that off-line methods perform significantly better only during fast-changing congested conditions, such as during incidents. The impacts of major influential factors on the performance of travel time estimation, including data preprocessing procedures, detector errors, detector spacing, frequency of travel time updates to traveler information devices, travel time link length, and posted travel time range, were investigated in this study. The results show that these factors have more significant impacts on the estimation accuracy and reliability under congested conditions than during uncongested conditions. For the incident conditions, the estimation quality improves with the use of a short rolling period for data smoothing, more accurate detector data, and frequent travel time updates.
Resumo:
The rapid growth of virtualized data centers and cloud hosting services is making the management of physical resources such as CPU, memory, and I/O bandwidth in data center servers increasingly important. Server management now involves dealing with multiple dissimilar applications with varying Service-Level-Agreements (SLAs) and multiple resource dimensions. The multiplicity and diversity of resources and applications are rendering administrative tasks more complex and challenging. This thesis aimed to develop a framework and techniques that would help substantially reduce data center management complexity.^ We specifically addressed two crucial data center operations. First, we precisely estimated capacity requirements of client virtual machines (VMs) while renting server space in cloud environment. Second, we proposed a systematic process to efficiently allocate physical resources to hosted VMs in a data center. To realize these dual objectives, accurately capturing the effects of resource allocations on application performance is vital. The benefits of accurate application performance modeling are multifold. Cloud users can size their VMs appropriately and pay only for the resources that they need; service providers can also offer a new charging model based on the VMs performance instead of their configured sizes. As a result, clients will pay exactly for the performance they are actually experiencing; on the other hand, administrators will be able to maximize their total revenue by utilizing application performance models and SLAs. ^ This thesis made the following contributions. First, we identified resource control parameters crucial for distributing physical resources and characterizing contention for virtualized applications in a shared hosting environment. Second, we explored several modeling techniques and confirmed the suitability of two machine learning tools, Artificial Neural Network and Support Vector Machine, to accurately model the performance of virtualized applications. Moreover, we suggested and evaluated modeling optimizations necessary to improve prediction accuracy when using these modeling tools. Third, we presented an approach to optimal VM sizing by employing the performance models we created. Finally, we proposed a revenue-driven resource allocation algorithm which maximizes the SLA-generated revenue for a data center.^
Resumo:
Ensemble Stream Modeling and Data-cleaning are sensor information processing systems have different training and testing methods by which their goals are cross-validated. This research examines a mechanism, which seeks to extract novel patterns by generating ensembles from data. The main goal of label-less stream processing is to process the sensed events to eliminate the noises that are uncorrelated, and choose the most likely model without over fitting thus obtaining higher model confidence. Higher quality streams can be realized by combining many short streams into an ensemble which has the desired quality. The framework for the investigation is an existing data mining tool. First, to accommodate feature extraction such as a bush or natural forest-fire event we make an assumption of the burnt area (BA*), sensed ground truth as our target variable obtained from logs. Even though this is an obvious model choice the results are disappointing. The reasons for this are two: One, the histogram of fire activity is highly skewed. Two, the measured sensor parameters are highly correlated. Since using non descriptive features does not yield good results, we resort to temporal features. By doing so we carefully eliminate the averaging effects; the resulting histogram is more satisfactory and conceptual knowledge is learned from sensor streams. Second is the process of feature induction by cross-validating attributes with single or multi-target variables to minimize training error. We use F-measure score, which combines precision and accuracy to determine the false alarm rate of fire events. The multi-target data-cleaning trees use information purity of the target leaf-nodes to learn higher order features. A sensitive variance measure such as ƒ-test is performed during each node's split to select the best attribute. Ensemble stream model approach proved to improve when using complicated features with a simpler tree classifier. The ensemble framework for data-cleaning and the enhancements to quantify quality of fitness (30% spatial, 10% temporal, and 90% mobility reduction) of sensor led to the formation of streams for sensor-enabled applications. Which further motivates the novelty of stream quality labeling and its importance in solving vast amounts of real-time mobile streams generated today.
Resumo:
The promise of Wireless Sensor Networks (WSNs) is the autonomous collaboration of a collection of sensors to accomplish some specific goals which a single sensor cannot offer. Basically, sensor networking serves a range of applications by providing the raw data as fundamentals for further analyses and actions. The imprecision of the collected data could tremendously mislead the decision-making process of sensor-based applications, resulting in an ineffectiveness or failure of the application objectives. Due to inherent WSN characteristics normally spoiling the raw sensor readings, many research efforts attempt to improve the accuracy of the corrupted or "dirty" sensor data. The dirty data need to be cleaned or corrected. However, the developed data cleaning solutions restrict themselves to the scope of static WSNs where deployed sensors would rarely move during the operation. Nowadays, many emerging applications relying on WSNs need the sensor mobility to enhance the application efficiency and usage flexibility. The location of deployed sensors needs to be dynamic. Also, each sensor would independently function and contribute its resources. Sensors equipped with vehicles for monitoring the traffic condition could be depicted as one of the prospective examples. The sensor mobility causes a transient in network topology and correlation among sensor streams. Based on static relationships among sensors, the existing methods for cleaning sensor data in static WSNs are invalid in such mobile scenarios. Therefore, a solution of data cleaning that considers the sensor movements is actively needed. This dissertation aims to improve the quality of sensor data by considering the consequences of various trajectory relationships of autonomous mobile sensors in the system. First of all, we address the dynamic network topology due to sensor mobility. The concept of virtual sensor is presented and used for spatio-temporal selection of neighboring sensors to help in cleaning sensor data streams. This method is one of the first methods to clean data in mobile sensor environments. We also study the mobility pattern of moving sensors relative to boundaries of sub-areas of interest. We developed a belief-based analysis to determine the reliable sets of neighboring sensors to improve the cleaning performance, especially when node density is relatively low. Finally, we design a novel sketch-based technique to clean data from internal sensors where spatio-temporal relationships among sensors cannot lead to the data correlations among sensor streams.
Resumo:
The purpose of this project was to evaluate the use of remote sensing 1) to detect and map Everglades wetland plant communities at different scales; and 2) to compare map products delineated and resampled at various scales with the intent to quantify and describe the quantitative and qualitative differences between such products. We evaluated data provided by Digital Globe’s WorldView 2 (WV2) sensor with a spatial resolution of 2m and data from Landsat’s Thematic and Enhanced Thematic Mapper (TM and ETM+) sensors with a spatial resolution of 30m. We were also interested in the comparability and scalability of products derived from these data sources. The adequacy of each data set to map wetland plant communities was evaluated utilizing two metrics: 1) model-based accuracy estimates of the classification procedures; and 2) design-based post-classification accuracy estimates of derived maps.
Resumo:
The purpose of this study was to determine which of the two methods is more appropriate to teach pitch discrimination to Grade 6 choral students to improve sight-singing note accuracy. This study consisted of three phases: pre-testing, instruction and post-testing. During the four week study, the experimental group received training using the Kodaly method while the control group received training using the traditional method. The pre and post tests were evaluated by three trained musicians. The analysis of the data utilized an independent t-test and a paired t-test with the methods of teaching (experimental and control) as a factor. Quantitative results suggest that the experimental subjects, those receiving Kodaly instruction at post-treatment showed a significant improvement in the pitch accuracy than the control group. The specific change resulted in the Kodaly method to be more effective in producing accurate pitch in sight-singing.
Resumo:
This dissertation develops a new mathematical approach that overcomes the effect of a data processing phenomenon known as "histogram binning" inherent to flow cytometry data. A real-time procedure is introduced to prove the effectiveness and fast implementation of such an approach on real-world data. The histogram binning effect is a dilemma posed by two seemingly antagonistic developments: (1) flow cytometry data in its histogram form is extended in its dynamic range to improve its analysis and interpretation, and (2) the inevitable dynamic range extension introduces an unwelcome side effect, the binning effect, which skews the statistics of the data, undermining as a consequence the accuracy of the analysis and the eventual interpretation of the data. Researchers in the field contended with such a dilemma for many years, resorting either to hardware approaches that are rather costly with inherent calibration and noise effects; or have developed software techniques based on filtering the binning effect but without successfully preserving the statistical content of the original data. The mathematical approach introduced in this dissertation is so appealing that a patent application has been filed. The contribution of this dissertation is an incremental scientific innovation based on a mathematical framework that will allow researchers in the field of flow cytometry to improve the interpretation of data knowing that its statistical meaning has been faithfully preserved for its optimized analysis. Furthermore, with the same mathematical foundation, proof of the origin of such an inherent artifact is provided. These results are unique in that new mathematical derivations are established to define and solve the critical problem of the binning effect faced at the experimental assessment level, providing a data platform that preserves its statistical content. In addition, a novel method for accumulating the log-transformed data was developed. This new method uses the properties of the transformation of statistical distributions to accumulate the output histogram in a non-integer and multi-channel fashion. Although the mathematics of this new mapping technique seem intricate, the concise nature of the derivations allow for an implementation procedure that lends itself to a real-time implementation using lookup tables, a task that is also introduced in this dissertation.
Resumo:
Correct specification of the simple location quotients in regionalizing the national direct requirements table is essential to the accuracy of regional input-output multipliers. The purpose of this research is to examine the relative accuracy of these multipliers when earnings, employment, number of establishments, and payroll data specify the simple location quotients. For each specification type, I derive a column of total output multipliers and a column of total income multipliers. These multipliers are based on the 1987 benchmark input-output accounts of the U.S. economy and 1988-1992 state of Florida data. Error sign tests, and Standardized Mean Absolute Deviation (SMAD) statistics indicate that the output multiplier estimates overestimate the output multipliers published by the Department of Commerce-Bureau of Economic Analysis (BEA) for the state of Florida. In contrast, the income multiplier estimates underestimate the BEA's income multipliers. For a given multiplier type, the Spearman-rank correlation analysis shows that the multiplier estimates and the BEA multipliers have statistically different rank ordering of row elements. The above tests also find no significant different differences, both in size and ranking distributions, among the vectors of multiplier estimates.