871 resultados para Extended Support Vector Machine
Resumo:
Empirical studies of education programs and systems, by nature, rely upon use of student outcomes that are measurable. Often, these come in the form of test scores. However, in light of growing evidence about the long-run importance of other student skills and behaviors, the time has come for a broader approach to evaluating education. This dissertation undertakes experimental, quasi-experimental, and descriptive analyses to examine social, behavioral, and health-related mechanisms of the educational process. My overarching research question is simply, which inside- and outside-the-classroom features of schools and educational interventions are most beneficial to students in the long term? Furthermore, how can we apply this evidence toward informing policy that could effectively reduce stark social, educational, and economic inequalities?
The first study of three assesses mechanisms by which the Fast Track project, a randomized intervention in the early 1990s for high-risk children in four communities (Durham, NC; Nashville, TN; rural PA; and Seattle, WA), reduced delinquency, arrests, and health and mental health service utilization in adolescence through young adulthood (ages 12-20). A decomposition of treatment effects indicates that about a third of Fast Track’s impact on later crime outcomes can be accounted for by improvements in social and self-regulation skills during childhood (ages 6-11), such as prosocial behavior, emotion regulation and problem solving. These skills proved less valuable for the prevention of mental and physical health problems.
The second study contributes new evidence on how non-instructional investments – such as increased spending on school social workers, guidance counselors, and health services – affect multiple aspects of student performance and well-being. Merging several administrative data sources spanning the 1996-2013 school years in North Carolina, I use an instrumental variables approach to estimate the extent to which local expenditure shifts affect students’ academic and behavioral outcomes. My findings indicate that exogenous increases in spending on non-instructional services not only reduce student absenteeism and disciplinary problems (important predictors of long-term outcomes) but also significantly raise student achievement, in similar magnitude to corresponding increases in instructional spending. Furthermore, subgroup analyses suggest that investments in student support personnel such as social workers, health services, and guidance counselors, in schools with concentrated low-income student populations could go a long way toward closing socioeconomic achievement gaps.
The third study examines individual pathways that lead to high school graduation or dropout. It employs a variety of machine learning techniques, including decision trees, random forests with bagging and boosting, and support vector machines, to predict student dropout using longitudinal administrative data from North Carolina. I consider a large set of predictor measures from grades three through eight including academic achievement, behavioral indicators, and background characteristics. My findings indicate that the most important predictors include eighth grade absences, math scores, and age-for-grade as well as early reading scores. Support vector classification (with a high cost parameter and low gamma parameter) predicts high school dropout with the highest overall validity in the testing dataset at 90.1 percent followed by decision trees with boosting and interaction terms at 89.5 percent.
Resumo:
Brain injury due to lack of oxygen or impaired blood flow around the time of birth, may cause long term neurological dysfunction or death in severe cases. The treatments need to be initiated as soon as possible and tailored according to the nature of the injury to achieve best outcomes. The Electroencephalogram (EEG) currently provides the best insight into neurological activities. However, its interpretation presents formidable challenge for the neurophsiologists. Moreover, such expertise is not widely available particularly around the clock in a typical busy Neonatal Intensive Care Unit (NICU). Therefore, an automated computerized system for detecting and grading the severity of brain injuries could be of great help for medical staff to diagnose and then initiate on-time treatments. In this study, automated systems for detection of neonatal seizures and grading the severity of Hypoxic-Ischemic Encephalopathy (HIE) using EEG and Heart Rate (HR) signals are presented. It is well known that there is a lot of contextual and temporal information present in the EEG and HR signals if examined at longer time scale. The systems developed in the past, exploited this information either at very early stage of the system without any intelligent block or at very later stage where presence of such information is much reduced. This work has particularly focused on the development of a system that can incorporate the contextual information at the middle (classifier) level. This is achieved by using dynamic classifiers that are able to process the sequences of feature vectors rather than only one feature vector at a time.
Resumo:
Hypertrophic cardiomyopathy (HCM) is a cardiovascular disease where the heart muscle is partially thickened and blood flow is - potentially fatally - obstructed. It is one of the leading causes of sudden cardiac death in young people. Electrocardiography (ECG) and Echocardiography (Echo) are the standard tests for identifying HCM and other cardiac abnormalities. The American Heart Association has recommended using a pre-participation questionnaire for young athletes instead of ECG or Echo tests due to considerations of cost and time involved in interpreting the results of these tests by an expert cardiologist. Initially we set out to develop a classifier for automated prediction of young athletes’ heart conditions based on the answers to the questionnaire. Classification results and further in-depth analysis using computational and statistical methods indicated significant shortcomings of the questionnaire in predicting cardiac abnormalities. Automated methods for analyzing ECG signals can help reduce cost and save time in the pre-participation screening process by detecting HCM and other cardiac abnormalities. Therefore, the main goal of this dissertation work is to identify HCM through computational analysis of 12-lead ECG. ECG signals recorded on one or two leads have been analyzed in the past for classifying individual heartbeats into different types of arrhythmia as annotated primarily in the MIT-BIH database. In contrast, we classify complete sequences of 12-lead ECGs to assign patients into two groups: HCM vs. non-HCM. The challenges and issues we address include missing ECG waves in one or more leads and the dimensionality of a large feature-set. We address these by proposing imputation and feature-selection methods. We develop heartbeat-classifiers by employing Random Forests and Support Vector Machines, and propose a method to classify full 12-lead ECGs based on the proportion of heartbeats classified as HCM. The results from our experiments show that the classifiers developed using our methods perform well in identifying HCM. Thus the two contributions of this thesis are the utilization of computational and statistical methods for discovering shortcomings in a current screening procedure and the development of methods to identify HCM through computational analysis of 12-lead ECG signals.
Resumo:
The continuous technology evaluation is benefiting our lives to a great extent. The evolution of Internet of things and deployment of wireless sensor networks is making it possible to have more connectivity between people and devices used extensively in our daily lives. Almost every discipline of daily life including health sector, transportation, agriculture etc. is benefiting from these technologies. There is a great potential of research and refinement of health sector as the current system is very often dependent on manual evaluations conducted by the clinicians. There is no automatic system for patient health monitoring and assessment which results to incomplete and less reliable heath information. Internet of things has a great potential to benefit health care applications by automated and remote assessment, monitoring and identification of diseases. Acute pain is the main cause of people visiting to hospitals. An automatic pain detection system based on internet of things with wireless devices can make the assessment and redemption significantly more efficient. The contribution of this research work is proposing pain assessment method based on physiological parameters. The physiological parameters chosen for this study are heart rate, electrocardiography, breathing rate and galvanic skin response. As a first step, the relation between these physiological parameters and acute pain experienced by the test persons is evaluated. The electrocardiography data collected from the test persons is analyzed to extract interbeat intervals. This evaluation clearly demonstrates specific patterns and trends in these parameters as a consequence of pain. This parametric behavior is then used to assess and identify the pain intensity by implementing machine learning algorithms. Support vector machines are used for classifying these parameters influenced by different pain intensities and classification results are achieved. The classification results with good accuracy rates between two and three levels of pain intensities shows clear indication of pain and the feasibility of this pain assessment method. An improved approach on the basis of this research work can be implemented by using both physiological parameters and electromyography data of facial muscles for classification.
Resumo:
SQL Injection Attack (SQLIA) remains a technique used by a computer network intruder to pilfer an organisation’s confidential data. This is done by an intruder re-crafting web form’s input and query strings used in web requests with malicious intent to compromise the security of an organisation’s confidential data stored at the back-end database. The database is the most valuable data source, and thus, intruders are unrelenting in constantly evolving new techniques to bypass the signature’s solutions currently provided in Web Application Firewalls (WAF) to mitigate SQLIA. There is therefore a need for an automated scalable methodology in the pre-processing of SQLIA features fit for a supervised learning model. However, obtaining a ready-made scalable dataset that is feature engineered with numerical attributes dataset items to train Artificial Neural Network (ANN) and Machine Leaning (ML) models is a known issue in applying artificial intelligence to effectively address ever evolving novel SQLIA signatures. This proposed approach applies numerical attributes encoding ontology to encode features (both legitimate web requests and SQLIA) to numerical data items as to extract scalable dataset for input to a supervised learning model in moving towards a ML SQLIA detection and prevention model. In numerical attributes encoding of features, the proposed model explores a hybrid of static and dynamic pattern matching by implementing a Non-Deterministic Finite Automaton (NFA). This combined with proxy and SQL parser Application Programming Interface (API) to intercept and parse web requests in transition to the back-end database. In developing a solution to address SQLIA, this model allows processed web requests at the proxy deemed to contain injected query string to be excluded from reaching the target back-end database. This paper is intended for evaluating the performance metrics of a dataset obtained by numerical encoding of features ontology in Microsoft Azure Machine Learning (MAML) studio using Two-Class Support Vector Machines (TCSVM) binary classifier. This methodology then forms the subject of the empirical evaluation.
Resumo:
Las organizaciones y sus entornos son sistemas complejos. Tales sistemas son difíciles de comprender y predecir. Pese a ello, la predicción es una tarea fundamental para la gestión empresarial y para la toma de decisiones que implica siempre un riesgo. Los métodos clásicos de predicción (entre los cuales están: la regresión lineal, la Autoregresive Moving Average y el exponential smoothing) establecen supuestos como la linealidad, la estabilidad para ser matemática y computacionalmente tratables. Por diferentes medios, sin embargo, se han demostrado las limitaciones de tales métodos. Pues bien, en las últimas décadas nuevos métodos de predicción han surgido con el fin de abarcar la complejidad de los sistemas organizacionales y sus entornos, antes que evitarla. Entre ellos, los más promisorios son los métodos de predicción bio-inspirados (ej. redes neuronales, algoritmos genéticos /evolutivos y sistemas inmunes artificiales). Este artículo pretende establecer un estado situacional de las aplicaciones actuales y potenciales de los métodos bio-inspirados de predicción en la administración.
Resumo:
Recently, we have built a classification model that is capable of assigning a given sesquiterpene lactone (STL) into exactly one tribe of the plant family Asteraceae from which the STL has been isolated. Although many plant species are able to biosynthesize a set of peculiar compounds, the occurrence of the same secondary metabolites in more than one tribe of Asteraceae is frequent. Building on our previous work, in this paper, we explore the possibility of assigning an STL to more than one tribe (class) simultaneously. When an object may belong to more than one class simultaneously, it is called multilabeled. In this work, we present a general overview of the techniques available to examine multilabeled data. The problem of evaluating the performance of a multilabeled classifier is discussed. Two particular multilabeled classification methods-cross-training with support vector machines (ct-SVM) and multilabeled k-nearest neighbors (M-L-kNN)were applied to the classification of the STLs into seven tribes from the plant family Asteraceae. The results are compared to a single-label classification and are analyzed from a chemotaxonomic point of view. The multilabeled approach allowed us to (1) model the reality as closely as possible, (2) improve our understanding of the relationship between the secondary metabolite profiles of different Asteraceae tribes, and (3) significantly decrease the number of plant sources to be considered for finding a certain STL. The presented classification models are useful for the targeted collection of plants with the objective of finding plant sources of natural compounds that are biologically active or possess other specific properties of interest.
Resumo:
In the context of cancer diagnosis and treatment, we consider the problem of constructing an accurate prediction rule on the basis of a relatively small number of tumor tissue samples of known type containing the expression data on very many (possibly thousands) genes. Recently, results have been presented in the literature suggesting that it is possible to construct a prediction rule from only a few genes such that it has a negligible prediction error rate. However, in these results the test error or the leave-one-out cross-validated error is calculated without allowance for the selection bias. There is no allowance because the rule is either tested on tissue samples that were used in the first instance to select the genes being used in the rule or because the cross-validation of the rule is not external to the selection process; that is, gene selection is not performed in training the rule at each stage of the cross-validation process. We describe how in practice the selection bias can be assessed and corrected for by either performing a cross-validation or applying the bootstrap external to the selection process. We recommend using 10-fold rather than leave-one-out cross-validation, and concerning the bootstrap, we suggest using the so-called. 632+ bootstrap error estimate designed to handle overfitted prediction rules. Using two published data sets, we demonstrate that when correction is made for the selection bias, the cross-validated error is no longer zero for a subset of only a few genes.
Resumo:
Trabalho de Projeto para obtenção do grau de Mestre em Engenharia Informática e de Computadores
Resumo:
Introdução Actualmente, as mensagens electrónicas são consideradas um importante meio de comunicação. As mensagens electrónicas – vulgarmente conhecidas como emails – são utilizadas fácil e frequentemente para enviar e receber o mais variado tipo de informação. O seu uso tem diversos fins gerando diariamente um grande número de mensagens e, consequentemente um enorme volume de informação. Este grande volume de informação requer uma constante manipulação das mensagens de forma a manter o conjunto organizado. Tipicamente esta manipulação consiste em organizar as mensagens numa taxonomia. A taxonomia adoptada reflecte os interesses e as preferências particulares do utilizador. Motivação A organização manual de emails é uma actividade morosa e que consome tempo. A optimização deste processo através da implementação de um método automático, tende a melhorar a satisfação do utilizador. Cada vez mais existe a necessidade de encontrar novas soluções para a manipulação de conteúdo digital poupando esforços e custos ao utilizador; esta necessidade, concretamente no âmbito da manipulação de emails, motivou a realização deste trabalho. Hipótese O objectivo principal deste projecto consiste em permitir a organização ad-hoc de emails com um esforço reduzido por parte do utilizador. A metodologia proposta visa organizar os emails num conjunto de categorias, disjuntas, que reflectem as preferências do utilizador. A principal finalidade deste processo é produzir uma organização onde as mensagens sejam classificadas em classes apropriadas requerendo o mínimo número esforço possível por parte do utilizador. Para alcançar os objectivos estipulados, este projecto recorre a técnicas de mineração de texto, em especial categorização automática de texto, e aprendizagem activa. Para reduzir a necessidade de inquirir o utilizador – para etiquetar exemplos de acordo com as categorias desejadas – foi utilizado o algoritmo d-confidence. Processo de organização automática de emails O processo de organizar automaticamente emails é desenvolvido em três fases distintas: indexação, classificação e avaliação. Na primeira fase, fase de indexação, os emails passam por um processo transformativo de limpeza que visa essencialmente gerar uma representação dos emails adequada ao processamento automático. A segunda fase é a fase de classificação. Esta fase recorre ao conjunto de dados resultantes da fase anterior para produzir um modelo de classificação, aplicando-o posteriormente a novos emails. Partindo de uma matriz onde são representados emails, termos e os seus respectivos pesos, e um conjunto de exemplos classificados manualmente, um classificador é gerado a partir de um processo de aprendizagem. O classificador obtido é então aplicado ao conjunto de emails e a classificação de todos os emails é alcançada. O processo de classificação é feito com base num classificador de máquinas de vectores de suporte recorrendo ao algoritmo de aprendizagem activa d-confidence. O algoritmo d-confidence tem como objectivo propor ao utilizador os exemplos mais significativos para etiquetagem. Ao identificar os emails com informação mais relevante para o processo de aprendizagem, diminui-se o número de iterações e consequentemente o esforço exigido por parte dos utilizadores. A terceira e última fase é a fase de avaliação. Nesta fase a performance do processo de classificação e a eficiência do algoritmo d-confidence são avaliadas. O método de avaliação adoptado é o método de validação cruzada denominado 10-fold cross validation. Conclusões O processo de organização automática de emails foi desenvolvido com sucesso, a performance do classificador gerado e do algoritmo d-confidence foi relativamente boa. Em média as categorias apresentam taxas de erro relativamente baixas, a não ser as classes mais genéricas. O esforço exigido pelo utilizador foi reduzido, já que com a utilização do algoritmo d-confidence obteve-se uma taxa de erro próxima do valor final, mesmo com um número de casos etiquetados abaixo daquele que é requerido por um método supervisionado. É importante salientar, que além do processo automático de organização de emails, este projecto foi uma excelente oportunidade para adquirir conhecimento consistente sobre mineração de texto e sobre os processos de classificação automática e recuperação de informação. O estudo de áreas tão interessantes despertou novos interesses que consistem em verdadeiros desafios futuros.
Resumo:
In research on Silent Speech Interfaces (SSI), different sources of information (modalities) have been combined, aiming at obtaining better performance than the individual modalities. However, when combining these modalities, the dimensionality of the feature space rapidly increases, yielding the well-known "curse of dimensionality". As a consequence, in order to extract useful information from this data, one has to resort to feature selection (FS) techniques to lower the dimensionality of the learning space. In this paper, we assess the impact of FS techniques for silent speech data, in a dataset with 4 non-invasive and promising modalities, namely: video, depth, ultrasonic Doppler sensing, and surface electromyography. We consider two supervised (mutual information and Fisher's ratio) and two unsupervised (meanmedian and arithmetic mean geometric mean) FS filters. The evaluation was made by assessing the classification accuracy (word recognition error) of three well-known classifiers (knearest neighbors, support vector machines, and dynamic time warping). The key results of this study show that both unsupervised and supervised FS techniques improve on the classification accuracy on both individual and combined modalities. For instance, on the video component, we attain relative performance gains of 36.2% in error rates. FS is also useful as pre-processing for feature fusion. Copyright © 2014 ISCA.
Resumo:
Dissertação apresentada para a obtenção do Grau de Doutor em Química Especialidade de Química Orgânica Pela Universidade Nova de Lisboa Faculdade de Ciências e Tecnologia
Resumo:
Many learning problems require handling high dimensional datasets with a relatively small number of instances. Learning algorithms are thus confronted with the curse of dimensionality, and need to address it in order to be effective. Examples of these types of data include the bag-of-words representation in text classification problems and gene expression data for tumor detection/classification. Usually, among the high number of features characterizing the instances, many may be irrelevant (or even detrimental) for the learning tasks. It is thus clear that there is a need for adequate techniques for feature representation, reduction, and selection, to improve both the classification accuracy and the memory requirements. In this paper, we propose combined unsupervised feature discretization and feature selection techniques, suitable for medium and high-dimensional datasets. The experimental results on several standard datasets, with both sparse and dense features, show the efficiency of the proposed techniques as well as improvements over previous related techniques.
Resumo:
Na atualidade, está a emergir um novo paradigma de interação, designado por Natural User Interface (NUI) para reconhecimento de gestos produzidos com o corpo do utilizador. O dispositivo de interação Microsoft Kinect foi inicialmente concebido para controlo de videojogos, para a consola Xbox360. Este dispositivo demonstra ser uma aposta viável para explorar outras áreas, como a do apoio ao processo de ensino e de aprendizagem para crianças do ensino básico. O protótipo desenvolvido visa definir um modo de interação baseado no desenho de letras no ar, e realizar a interpretação dos símbolos desenhados, usando os reconhecedores de padrões Kernel Discriminant Analysis (KDA), Support Vector Machines (SVM) e $N. O desenvolvimento deste projeto baseou-se no estudo dos diferentes dispositivos NUI disponíveis no mercado, bibliotecas de desenvolvimento NUI para este tipo de dispositivos e algoritmos de reconhecimento de padrões. Com base nos dois elementos iniciais, foi possível obter uma visão mais concreta de qual o hardware e software disponíveis indicados à persecução do objetivo pretendido. O reconhecimento de padrões constitui um tema bastante extenso e complexo, de modo que foi necessária a seleção de um conjunto limitado deste tipo de algoritmos, realizando os respetivos testes por forma a determinar qual o que melhor se adequava ao objetivo pretendido. Aplicando as mesmas condições aos três algoritmos de reconhecimento de padrões permitiu avaliar as suas capacidades e determinar o $N como o que apresentou maior eficácia no reconhecimento. Por último, tentou-se averiguar a viabilidade do protótipo desenvolvido, tendo sido testado num universo de elementos de duas faixas etárias para determinar a capacidade de adaptação e aprendizagem destes dois grupos. Neste estudo, constatou-se um melhor desempenho inicial ao modo de interação do grupo de idade mais avançada. Contudo, o grupo mais jovem foi revelando uma evolutiva capacidade de adaptação a este modo de interação melhorando progressivamente os resultados.
Resumo:
Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade No Lisboa para obtenção de grau de Mestre em Engenharia de Informática