820 resultados para Data-Mining Techniques
Resumo:
The origins for this work arise in response to the increasing need for biologists and doctors to obtain tools for visual analysis of data. When dealing with multidimensional data, such as medical data, the traditional data mining techniques can be a tedious and complex task, even to some medical experts. Therefore, it is necessary to develop useful visualization techniques that can complement the expert’s criterion, and at the same time visually stimulate and make easier the process of obtaining knowledge from a dataset. Thus, the process of interpretation and understanding of the data can be greatly enriched. Multidimensionality is inherent to any medical data, requiring a time-consuming effort to get a clinical useful outcome. Unfortunately, both clinicians and biologists are not trained in managing more than four dimensions. Specifically, we were aimed to design a 3D visual interface for gene profile analysis easy in order to be used both by medical and biologist experts. In this way, a new analysis method is proposed: MedVir. This is a simple and intuitive analysis mechanism based on the visualization of any multidimensional medical data in a three dimensional space that allows interaction with experts in order to collaborate and enrich this representation. In other words, MedVir makes a powerful reduction in data dimensionality in order to represent the original information into a three dimensional environment. The experts can interact with the data and draw conclusions in a visual and quickly way.
Resumo:
The use of data mining techniques for the gene profile discovery of diseases, such as cancer, is becoming usual in many researches. These techniques do not usually analyze the relationships between genes in depth, depending on the different variety of manifestations of the disease (related to patients). This kind of analysis takes a considerable amount of time and is not always the focus of the research. However, it is crucial in order to generate personalized treatments to fight the disease. Thus, this research focuses on finding a mechanism for gene profile analysis to be used by the medical and biologist experts. Results: In this research, the MedVir framework is proposed. It is an intuitive mechanism based on the visualization of medical data such as gene profiles, patients, clinical data, etc. MedVir, which is based on an Evolutionary Optimization technique, is a Dimensionality Reduction (DR) approach that presents the data in a three dimensional space. Furthermore, thanks to Virtual Reality technology, MedVir allows the expert to interact with the data in order to tailor it to the experience and knowledge of the expert.
Resumo:
Objective The main purpose of this research is the novel use of artificial metaplasticity on multilayer perceptron (AMMLP) as a data mining tool for prediction the outcome of patients with acquired brain injury (ABI) after cognitive rehabilitation. The final goal aims at increasing knowledge in the field of rehabilitation theory based on cognitive affectation. Methods and materials The data set used in this study contains records belonging to 123 ABI patients with moderate to severe cognitive affectation (according to Glasgow Coma Scale) that underwent rehabilitation at Institut Guttmann Neurorehabilitation Hospital (IG) using the tele-rehabilitation platform PREVIRNEC©. The variables included in the analysis comprise the neuropsychological initial evaluation of the patient (cognitive affectation profile), the results of the rehabilitation tasks performed by the patient in PREVIRNEC© and the outcome of the patient after a 3–5 months treatment. To achieve the treatment outcome prediction, we apply and compare three different data mining techniques: the AMMLP model, a backpropagation neural network (BPNN) and a C4.5 decision tree. Results The prediction performance of the models was measured by ten-fold cross validation and several architectures were tested. The results obtained by the AMMLP model are clearly superior, with an average predictive performance of 91.56%. BPNN and C4.5 models have a prediction average accuracy of 80.18% and 89.91% respectively. The best single AMMLP model provided a specificity of 92.38%, a sensitivity of 91.76% and a prediction accuracy of 92.07%. Conclusions The proposed prediction model presented in this study allows to increase the knowledge about the contributing factors of an ABI patient recovery and to estimate treatment efficacy in individual patients. The ability to predict treatment outcomes may provide new insights toward improving effectiveness and creating personalized therapeutic interventions based on clinical evidence.
Resumo:
El abandono universitario es un problema que siempre ha afectado a las distintas universidades públicas. La Universidad Politécnica de Madrid, en concreto la Escuela Técnica Superior de Ingenieros Informáticos, pretende reducir dicho abandono aplicando técnicas de Data Mining para detectar aquellos alumnos que abandonan antes de que lo hagan. Este proyecto, a través de la metodología CRISP-DM (CRoss Industry Standard Process for Data Mining) y con los datos obtenidos del SIIU (Sistema Integrado de Información Universitaria), tiene como fin identificar a los alumnos que abandonan el primer añoo para poder aplicar distintas políticas en ellos y así evitarlo. ---ABSTRACT---Early university abandonment is a problem which has affected different public universities. The Universidad Polit´ecnica de Madrid (Polytechnic University of Madrid), inparticular the Escuela Técnica Superior de Ingenieros Informáticos (Computer Science Engineering School), wants to reduce this abandonment rate using Data Mining techniques to find students who are likely to leave before they do. The purpose of this project, through CRISP-DM (CRoss Industry Standard Process for Data Mining) methodology and using the data from SIIU (Sistema Integrado de Informaci´on Universitaria), is to identify those students who leave the first year in order to apply different policies and avoid it.
Resumo:
Durante los últimos años ha aumentado la presencia de personas pertenecientes al mundo de la política en la red debido a la proliferación de las redes sociales, siendo Twitter la que mayor repercusión mediática tiene en este ámbito. El estudio del comportamiento de los políticos en Twitter y de la acogida que tienen entre los ciudadanos proporciona información muy valiosa a la hora de analizar las campañas electorales. De esta forma, se puede estudiar la repercusión real que tienen sus mensajes en los resultados electorales, así como distinguir aquellos comportamientos que tienen una mayor aceptación por parte de la la ciudadaná. Gracias a los avances desarrollados en el campo de la minería de textos, se poseen las herramientas necesarias para analizar un gran volumen de textos y extraer de ellos información de utilidad. Este proyecto tiene como finalidad recopilar una muestra significativa de mensajes de Twitter pertenecientes a los candidatos de los principales partidos políticos que se presentan a las elecciones autonómicas de Madrid en 2015. Estos mensajes, junto con las respuestas de otros usuarios, se han analizado usando algoritmos de aprendizaje automático y aplicando las técnicas de minería de textos más oportunas. Los resultados obtenidos para cada político se han examinado en profundidad y se han presentado mediante tablas y gráficas para facilitar su comprensión.---ABSTRACT---During the past few years the presence on the Internet of people related with politics has increased, due to the proliferation of social networks. Among all existing social networks, Twitter is the one which has the greatest media impact in this field. Therefore, an analysis of the behaviour of politicians in this social network, along with the response from the citizens, gives us very valuable information when analysing electoral campaigns. This way it is possible to know their messages impact in the election results. Moreover, it can be inferred which behaviours have better acceptance among the citizenship. Thanks to the advances achieved in the text mining field, its tools can be used to analyse a great amount of texts and extract from them useful information. The present project aims to collect a significant sample of Twitter messages from the candidates of the principal political parties for the 2015 autonomic elections in Madrid. These messages, as well as the answers received by the other users, have been analysed using machine learning algorithms and applying the most suitable data mining techniques. The results obtained for each politician have been examined in depth and have been presented using tables and graphs to make its understanding easier.
Resumo:
Poder clasificar de manera precisa la aplicación o programa del que provienen los flujos que conforman el tráfico de uso de Internet dentro de una red permite tanto a empresas como a organismos una útil herramienta de gestión de los recursos de sus redes, así como la posibilidad de establecer políticas de prohibición o priorización de tráfico específico. La proliferación de nuevas aplicaciones y de nuevas técnicas han dificultado el uso de valores conocidos (well-known) en puertos de aplicaciones proporcionados por la IANA (Internet Assigned Numbers Authority) para la detección de dichas aplicaciones. Las redes P2P (Peer to Peer), el uso de puertos no conocidos o aleatorios, y el enmascaramiento de tráfico de muchas aplicaciones en tráfico HTTP y HTTPS con el fin de atravesar firewalls y NATs (Network Address Translation), entre otros, crea la necesidad de nuevos métodos de detección de tráfico. El objetivo de este estudio es desarrollar una serie de prácticas que permitan realizar dicha tarea a través de técnicas que están más allá de la observación de puertos y otros valores conocidos. Existen una serie de metodologías como Deep Packet Inspection (DPI) que se basa en la búsqueda de firmas, signatures, en base a patrones creados por el contenido de los paquetes, incluido el payload, que caracterizan cada aplicación. Otras basadas en el aprendizaje automático de parámetros de los flujos, Machine Learning, que permite determinar mediante análisis estadísticos a qué aplicación pueden pertenecer dichos flujos y, por último, técnicas de carácter más heurístico basadas en la intuición o el conocimiento propio sobre tráfico de red. En concreto, se propone el uso de alguna de las técnicas anteriormente comentadas en conjunto con técnicas de minería de datos como son el Análisis de Componentes Principales (PCA por sus siglas en inglés) y Clustering de estadísticos extraídos de los flujos procedentes de ficheros de tráfico de red. Esto implicará la configuración de diversos parámetros que precisarán de un proceso iterativo de prueba y error que permita dar con una clasificación del tráfico fiable. El resultado ideal sería aquel en el que se pudiera identificar cada aplicación presente en el tráfico en un clúster distinto, o en clusters que agrupen grupos de aplicaciones de similar naturaleza. Para ello, se crearán capturas de tráfico dentro de un entorno controlado e identificando cada tráfico con su aplicación correspondiente, a continuación se extraerán los flujos de dichas capturas. Tras esto, parámetros determinados de los paquetes pertenecientes a dichos flujos serán obtenidos, como por ejemplo la fecha y hora de llagada o la longitud en octetos del paquete IP. Estos parámetros serán cargados en una base de datos MySQL y serán usados para obtener estadísticos que ayuden, en un siguiente paso, a realizar una clasificación de los flujos mediante minería de datos. Concretamente, se usarán las técnicas de PCA y clustering haciendo uso del software RapidMiner. Por último, los resultados obtenidos serán plasmados en una matriz de confusión que nos permitirá que sean valorados correctamente. ABSTRACT. Being able to classify the applications that generate the traffic flows in an Internet network allows companies and organisms to implement efficient resource management policies such as prohibition of specific applications or prioritization of certain application traffic, looking for an optimization of the available bandwidth. The proliferation of new applications and new technics in the last years has made it more difficult to use well-known values assigned by the IANA (Internet Assigned Numbers Authority), like UDP and TCP ports, to identify the traffic. Also, P2P networks and data encapsulation over HTTP and HTTPS traffic has increased the necessity to improve these traffic analysis technics. The aim of this project is to develop a number of techniques that make us able to classify the traffic with more than the simple observation of the well-known ports. There are some proposals that have been created to cover this necessity; Deep Packet Inspection (DPI) tries to find signatures in the packets reading the information contained in them, the payload, looking for patterns that can be used to characterize the applications to which that traffic belongs; Machine Learning procedures work with statistical analysis of the flows, trying to generate an automatic process that learns from those statistical parameters and calculate the likelihood of a flow pertaining to a certain application; Heuristic Techniques, finally, are based in the intuition or the knowledge of the researcher himself about the traffic being analyzed that can help him to characterize the traffic. Specifically, the use of some of the techniques previously mentioned in combination with data mining technics such as Principal Component Analysis (PCA) and Clustering (grouping) of the flows extracted from network traffic captures are proposed. An iterative process based in success and failure will be needed to configure these data mining techniques looking for a reliable traffic classification. The perfect result would be the one in which the traffic flows of each application is grouped correctly in each cluster or in clusters that contain group of applications of similar nature. To do this, network traffic captures will be created in a controlled environment in which every capture is classified and known to pertain to a specific application. Then, for each capture, all the flows will be extracted. These flows will be used to extract from them information such as date and arrival time or the IP length of the packets inside them. This information will be then loaded to a MySQL database where all the packets defining a flow will be classified and also, each flow will be assigned to its specific application. All the information obtained from the packets will be used to generate statistical parameters in order to describe each flow in the best possible way. After that, data mining techniques previously mentioned (PCA and Clustering) will be used on these parameters making use of the software RapidMiner. Finally, the results obtained from the data mining will be compared with the real classification of the flows that can be obtained from the database. A Confusion Matrix will be used for the comparison, letting us measure the veracity of the developed classification process.
Estudio de patrones de interacción entre los estudiantes y la Plataforma de Tele-Enseñanza en la UPM
Resumo:
Vivimos en una sociedad en la que la información ha adquirido una vital importancia. El uso de Internet y el desarrollo de nuevos sistemas de la información han generado un ferviente interés tanto de empresas como de instituciones en la búsqueda de nuevos patrones que les proporcione la clave del éxito. La Analítica de Negocio reúne un conjunto de herramientas, estrategias y técnicas orientadas a la explotación de la información con el objetivo de crear conocimiento útil dentro de un marco de trabajo y facilitar la optimización de los recursos tanto de empresas como de instituciones. El presente proyecto se enmarca en lo que se conoce como Gestión Educativa. Se aplicará una arquitectura y modelo de trabajo similar a lo que se ha venido haciendo en los últimos años en el entorno empresarial con la Inteligencia de Negocio. Con esta variante, se pretende mejorar la calidad de la enseñanza, agilizar las decisiones dentro de la institución académica, fortalecer las capacidades del cuerpo docente y en definitiva favorecer el aprendizaje del alumnado. Para lograr el objetivo se ha decidido seguir las etapas del Knowledge Discovery in Databases (KDD), una de las metodologías más conocidas dentro de la Inteligencia de Negocio, que describe el procedimiento que va desde la selección de la información y su carga en sistemas de almacenamiento, hasta la aplicación de técnicas de minería de datos para la obtención nuevo conocimiento. Los estudios se realizan a partir de la información de la activad de los usuarios dentro la plataforma de Tele-Enseñanza de la Universidad Politécnica de Madrid (Moodle). Se desarrollan trabajos de extracción y preprocesado de la base de datos en crudo y se aplican técnicas de minería de datos. En la aplicación de técnicas de minería de datos, uno de los factores más importantes a tener en cuenta es el tipo de información que se va a tratar. Por este motivo, se trabaja con la Minería de Datos Educativa, en inglés, Educational Data Mining (EDM) que consiste en la aplicación de técnicas de minería optimizadas para la información que se genera en entornos educativos. Dentro de las posibilidades que ofrece el EDM, se ha decidido centrar los estudios en lo que se conoce como analítica predictiva. El objetivo fundamental es conocer la influencia que tienen las interacciones alumno-plataforma en las calificaciones finales y descubrir nuevas reglas que describan comportamientos que faciliten al profesorado discriminar si un estudiante va a aprobar o suspender la asignatura, de tal forma que se puedan tomar medidas que mejoren su rendimiento. Toda la información tratada en el presente proyecto ha sido previamente anonimizada para evitar cualquier tipo de intromisión que atente contra la privacidad de los elementos participantes en el estudio. ABSTRACT. We live in a society dominated by data. The use of the Internet accompanied by developments in information systems has generated a sustained interest among companies and institutions to discover new patterns to succeed in their business ventures. Business Analytics (BA) combines tools, strategies and techniques focused on exploiting the available information, to optimize resources and create useful insight. The current project is framed under Educational Management. A Business Intelligence (BI) architecture and business models taught up to date will be applied with the aim to accelerate the decision-making in academic institutions, strengthen teacher´s skills and ultimately improve the quality of teaching and learning. The best way to achieve this is to follow the Knowledge Discovery in Databases (KDD), one of the best-known methodologies in B.I. This process describes data preparation, selection, and cleansing through to the application of purely Data Mining Techniques in order to incorporate prior knowledge on data sets and interpret accurate solutions from the observed results. The studies will be performed using the information extracted from the Universidad Politécnica de Madrid Learning Management System (LMS), Moodle. The stored data is based on the user-platform interaction. The raw data will be extracted and pre-processed and afterwards, Data Mining Techniques will be applied. One of the crucial factors in the application of Data Mining Techniques is the kind of information that will be processed. For this reason, a new Data Mining perspective will be taken, called Educational Data Mining (EDM). EDM consists of the application of Data Mining Techniques but optimized for the raw data generated by the educational environment. Within EDM, we have decided to drive our research on what is called Predictive Analysis. The main purpose is to understand the influence of the user-platform interactions in the final grades of students and discover new patterns that explain their behaviours. This could allow teachers to intervene ahead of a student passing or failing, in such a way an action could be taken to improve the student performance. All the information processed has been previously anonymized to avoid the invasion of privacy.
Resumo:
Esta tesis presenta el diseño y la aplicación de una metodología que permite la determinación de los parámetros para la planificación de nodos e infraestructuras logísticas en un territorio, considerando además el impacto de estas en los diferentes componentes territoriales, así como en el desarrollo poblacional, el desarrollo económico y el medio ambiente, presentando así un avance en la planificación integral del territorio. La Metodología propuesta está basada en Minería de Datos, que permite el descubrimiento de patrones detrás de grandes volúmenes de datos previamente procesados. Las características propias de los datos sobre el territorio y los componentes que lo conforman hacen de los estudios territoriales un campo ideal para la aplicación de algunas de las técnicas de Minería de Datos, tales como los ´arboles decisión y las redes bayesianas. Los árboles de decisión permiten representar y categorizar de forma esquemática una serie de variables de predicción que ayudan al análisis de una variable objetivo. Las redes bayesianas representan en un grafo acíclico dirigido, un modelo probabilístico de variables distribuidas en padres e hijos, y la inferencia estadística que permite determinar la probabilidad de certeza de una hipótesis planteada, es decir, permiten construir modelos de probabilidad conjunta que presentan de manera gráfica las dependencias relevantes en un conjunto de datos. Al igual que con los árboles de decisión, la división del territorio en diferentes unidades administrativas hace de las redes bayesianas una herramienta potencial para definir las características físicas de alguna tipología especifica de infraestructura logística tomando en consideración las características territoriales, poblacionales y económicas del área donde se plantea su desarrollo y las posibles sinergias que se puedan presentar sobre otros nodos e infraestructuras logísticas. El caso de estudio seleccionado para la aplicación de la metodología ha sido la República de Panamá, considerando que este país presenta algunas características singulares, entra las que destacan su alta concentración de población en la Ciudad de Panamá; que a su vez a concentrado la actividad económica del país; su alto porcentaje de zonas protegidas, lo que ha limitado la vertebración del territorio; y el Canal de Panamá y los puertos de contenedores adyacentes al mismo. La metodología se divide en tres fases principales: Fase 1: Determinación del escenario de trabajo 1. Revisión del estado del arte. 2. Determinación y obtención de las variables de estudio. Fase 2: Desarrollo del modelo de inteligencia artificial 3. Construcción de los ´arboles de decisión. 4. Construcción de las redes bayesianas. Fase 3: Conclusiones 5. Determinación de las conclusiones. Con relación al modelo de planificación aplicado al caso de estudio, una vez aplicada la metodología, se estableció un modelo compuesto por 47 variables que definen la planificación logística de Panamá, el resto de variables se definen a partir de estas, es decir, conocidas estas, el resto se definen a través de ellas. Este modelo de planificación establecido a través de la red bayesiana considera los aspectos de una planificación sostenible: económica, social y ambiental; que crean sinergia con la planificación de nodos e infraestructuras logísticas. The thesis presents the design and application of a methodology that allows the determination of parameters for the planning of nodes and logistics infrastructure in a territory, besides considering the impact of these different territorial components, as well as the population growth, economic and environmental development. The proposed methodology is based on Data Mining, which allows the discovery of patterns behind large volumes of previously processed data. The own characteristics of the territorial data makes of territorial studies an ideal field of knowledge for the implementation of some of the Data Mining techniques, such as Decision Trees and Bayesian Networks. Decision trees categorize schematically a series of predictor variables of an analyzed objective variable. Bayesian Networks represent a directed acyclic graph, a probabilistic model of variables divided in fathers and sons, and statistical inference that allow determine the probability of certainty in a hypothesis. The case of study for the application of the methodology is the Republic of Panama. This country has some unique features: a high population density in the Panama City, a concentration of economic activity, a high percentage of protected areas, and the Panama Canal. The methodology is divided into three main phases: Phase 1: definition of the work stage. 1. Review of the State of the art. 2. Determination of the variables. Phase 2: Development of artificial intelligence model 3. Construction of decision trees. 4. Construction of Bayesian Networks. Phase 3: conclusions 5. Determination of the conclusions. The application of the methodology to the case study established a model composed of 47 variables that define the logistics planning for Panama. This model of planning established through the Bayesian network considers aspects of sustainable planning and simulates the synergies between the nodes and logistical infrastructure planning.
Resumo:
Geographic knowledge discovery (GKD) is the process of extracting information and knowledge from massive georeferenced databases. Usually the process is accomplished by two different systems, the Geographic Information Systems (GIS) and the data mining engines. However, the development of those systems is a complex task due to it does not follow a systematic, integrated and standard methodology. To overcome these pitfalls, in this paper, we propose a modeling framework that addresses the development of the different parts of a multilayer GKD process. The main advantages of our framework are that: (i) it reduces the design effort, (ii) it improves quality systems obtained, (iii) it is independent of platforms, (iv) it facilitates the use of data mining techniques on geo-referenced data, and finally, (v) it ameliorates the communication between different users.
Resumo:
Tese de mestrado, Bioinformática e Biologia Computacional (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2016
Resumo:
A major task of traditional temporal event sequence mining is to find all frequent event patterns from a long temporal sequence. In many real applications, however, events are often grouped into different types, and not all types are of equal importance. In this paper, we consider the problem of efficient mining of temporal event sequences which lead to an instance of a specific type of event. Temporal constraints are used to ensure sensibility of the mining results. We will first generalise and formalise the problem of event-oriented temporal sequence data mining. After discussing some unique issues in this new problem, we give a set of criteria, which are adapted from traditional data mining techniques, to measure the quality of patterns to be discovered. Finally we present an algorithm to discover potentially interesting patterns.
Resumo:
Customer Base Analysis is perhaps the first stage of analysis in customer value, aiming to predict purchase frequency and customer lifecycle. An important part of the customer purchase frequency and its retention has to do with the service upgrade. Many models have tried to predict purchase frequency as well as upgrading. The comparison of these models seems important to provide academics with a picture of the current situation. The purpose of this research is to evaluate how models can predict service upgrade among a customer database of an online DVD rental company and suggest an alternative based on data mining techniques and data on historical transactions.
Resumo:
With advances in science and technology, computing and business intelligence (BI) systems are steadily becoming more complex with an increasing variety of heterogeneous software and hardware components. They are thus becoming progressively more difficult to monitor, manage and maintain. Traditional approaches to system management have largely relied on domain experts through a knowledge acquisition process that translates domain knowledge into operating rules and policies. It is widely acknowledged as a cumbersome, labor intensive, and error prone process, besides being difficult to keep up with the rapidly changing environments. In addition, many traditional business systems deliver primarily pre-defined historic metrics for a long-term strategic or mid-term tactical analysis, and lack the necessary flexibility to support evolving metrics or data collection for real-time operational analysis. There is thus a pressing need for automatic and efficient approaches to monitor and manage complex computing and BI systems. To realize the goal of autonomic management and enable self-management capabilities, we propose to mine system historical log data generated by computing and BI systems, and automatically extract actionable patterns from this data. This dissertation focuses on the development of different data mining techniques to extract actionable patterns from various types of log data in computing and BI systems. Four key problems—Log data categorization and event summarization, Leading indicator identification , Pattern prioritization by exploring the link structures , and Tensor model for three-way log data are studied. Case studies and comprehensive experiments on real application scenarios and datasets are conducted to show the effectiveness of our proposed approaches.
Resumo:
The rapid growth of the Internet and the advancements of the Web technologies have made it possible for users to have access to large amounts of on-line music data, including music acoustic signals, lyrics, style/mood labels, and user-assigned tags. The progress has made music listening more fun, but has raised an issue of how to organize this data, and more generally, how computer programs can assist users in their music experience. An important subject in computer-aided music listening is music retrieval, i.e., the issue of efficiently helping users in locating the music they are looking for. Traditionally, songs were organized in a hierarchical structure such as genre->artist->album->track, to facilitate the users’ navigation. However, the intentions of the users are often hard to be captured in such a simply organized structure. The users may want to listen to music of a particular mood, style or topic; and/or any songs similar to some given music samples. This motivated us to work on user-centric music retrieval system to improve users’ satisfaction with the system. The traditional music information retrieval research was mainly concerned with classification, clustering, identification, and similarity search of acoustic data of music by way of feature extraction algorithms and machine learning techniques. More recently the music information retrieval research has focused on utilizing other types of data, such as lyrics, user-access patterns, and user-defined tags, and on targeting non-genre categories for classification, such as mood labels and styles. This dissertation focused on investigating and developing effective data mining techniques for (1) organizing and annotating music data with styles, moods and user-assigned tags; (2) performing effective analysis of music data with features from diverse information sources; and (3) recommending music songs to the users utilizing both content features and user access patterns.
Resumo:
Soft skills and teamwork practices were identi ed as the main de ciencies of recent graduates in computer courses. This issue led to a realization of a qualitative research aimed at investigating the challenges faced by professors of those courses in conducting, monitoring and assessing collaborative software development projects. Di erent challenges were reported by teachers, including di culties in the assessment of students both in the collective and individual levels. In this context, a quantitative research was conducted with the aim to map soft skill of students to a set of indicators that can be extracted from software repositories using data mining techniques. These indicators are aimed at measuring soft skills, such as teamwork, leadership, problem solving and the pace of communication. Then, a peer assessment approach was applied in a collaborative software development course of the software engineering major at the Federal University of Rio Grande do Norte (UFRN). This research presents a correlation study between the students' soft skills scores and indicators based on mining software repositories. This study contributes: (i) in the presentation of professors' perception of the di culties and opportunities for improving management and monitoring practices in collaborative software development projects; (ii) in investigating relationships between soft skills and activities performed by students using software repositories; (iii) in encouraging the development of soft skills and the use of software repositories among software engineering students; (iv) in contributing to the state of the art of three important areas of software engineering, namely software engineering education, educational data mining and human aspects of software engineering.