905 resultados para Web Mining, Data Mining, User Topic Model, Web User Profiles
Resumo:
Recent advances in machine learning methods enable increasingly the automatic construction of various types of computer assisted methods that have been difficult or laborious to program by human experts. The tasks for which this kind of tools are needed arise in many areas, here especially in the fields of bioinformatics and natural language processing. The machine learning methods may not work satisfactorily if they are not appropriately tailored to the task in question. However, their learning performance can often be improved by taking advantage of deeper insight of the application domain or the learning problem at hand. This thesis considers developing kernel-based learning algorithms incorporating this kind of prior knowledge of the task in question in an advantageous way. Moreover, computationally efficient algorithms for training the learning machines for specific tasks are presented. In the context of kernel-based learning methods, the incorporation of prior knowledge is often done by designing appropriate kernel functions. Another well-known way is to develop cost functions that fit to the task under consideration. For disambiguation tasks in natural language, we develop kernel functions that take account of the positional information and the mutual similarities of words. It is shown that the use of this information significantly improves the disambiguation performance of the learning machine. Further, we design a new cost function that is better suitable for the task of information retrieval and for more general ranking problems than the cost functions designed for regression and classification. We also consider other applications of the kernel-based learning algorithms such as text categorization, and pattern recognition in differential display. We develop computationally efficient algorithms for training the considered learning machines with the proposed kernel functions. We also design a fast cross-validation algorithm for regularized least-squares type of learning algorithm. Further, an efficient version of the regularized least-squares algorithm that can be used together with the new cost function for preference learning and ranking tasks is proposed. In summary, we demonstrate that the incorporation of prior knowledge is possible and beneficial, and novel advanced kernels and cost functions can be used in algorithms efficiently.
Resumo:
The article discusses the development of WEBDATANET established in 2011 which aims to create a multidisciplinary network of web-based data collection experts in Europe. Topics include the presence of 190 experts in 30 European countries and abroad, the establishment of web-based teaching and discussion platforms and working groups and task forces. Also discussed is the scope of the research carried by WEBDATANET. In light of the growing importance of web-based data in the social and behavioral sciences, WEBDATANET was established in 2011 as a COST Action (IS 1004) to create a multidisciplinary network of web-based data collection experts: (web) survey methodologists, psychologists, sociologists, linguists, economists, Internet scientists, media and public opinion researchers. The aim was to accumulate and synthesize knowledge regarding methodological issues of web-based data collection (surveys, experiments, tests, non-reactive data, and mobile Internet research), and foster its scientific usage in a broader community.
Resumo:
Occupational hygiene practitioners typically assess the risk posed by occupational exposure by comparing exposure measurements to regulatory occupational exposure limits (OELs). In most jurisdictions, OELs are only available for exposure by the inhalation pathway. Skin notations are used to indicate substances for which dermal exposure may lead to health effects. However, these notations are either present or absent and provide no indication of acceptable levels of exposure. Furthermore, the methodology and framework for assigning skin notation differ widely across jurisdictions resulting in inconsistencies in the substances that carry notations. The UPERCUT tool was developed in response to these limitations. It helps occupational health stakeholders to assess the hazard associated with dermal exposure to chemicals. UPERCUT integrates dermal quantitative structure-activity relationships (QSARs) and toxicological data to provide users with a skin hazard index called the dermal hazard ratio (DHR) for the substance and scenario of interest. The DHR is the ratio between the estimated 'received' dose and the 'acceptable' dose. The 'received' dose is estimated using physico-chemical data and information on the exposure scenario provided by the user (body parts exposure and exposure duration), and the 'acceptable' dose is estimated using inhalation OELs and toxicological data. The uncertainty surrounding the DHR is estimated with Monte Carlo simulation. Additional information on the selected substances includes intrinsic skin permeation potential of the substance and the existence of skin notations. UPERCUT is the only available tool that estimates the absorbed dose and compares this to an acceptable dose. In the absence of dermal OELs it provides a systematic and simple approach for screening dermal exposure scenarios for 1686 substances.
Resumo:
Objective To construct a Portuguese language index of information on the practice of diagnostic radiology in order to improve the standardization of the medical language and terminology. Materials and Methods A total of 61,461 definitive reports were collected from the database of the Radiology Information System at Hospital das Clínicas – Faculdade de Medicina de Ribeirão Preto (RIS/HCFMRP) as follows: 30,000 chest x-ray reports; 27,000 mammography reports; and 4,461 thyroid ultrasonography reports. The text mining technique was applied for the selection of terms, and the ANSI/NISO Z39.19-2005 standard was utilized to construct the index based on a thesaurus structure. The system was created in *html. Results The text mining resulted in a set of 358,236 (n = 100%) words. Out of this total, 76,347 (n = 21%) terms were selected to form the index. Such terms refer to anatomical pathology description, imaging techniques, equipment, type of study and some other composite terms. The index system was developed with 78,538 *html web pages. Conclusion The utilization of text mining on a radiological reports database has allowed the construction of a lexical system in Portuguese language consistent with the clinical practice in Radiology.
Resumo:
In this thesis we study the field of opinion mining by giving a comprehensive review of the available research that has been done in this topic. Also using this available knowledge we present a case study of a multilevel opinion mining system for a student organization's sales management system. We describe the field of opinion mining by discussing its historical roots, its motivations and applications as well as the different scientific approaches that have been used to solve this challenging problem of mining opinions. To deal with this huge subfield of natural language processing, we first give an abstraction of the problem of opinion mining and describe the theoretical frameworks that are available for dealing with appraisal language. Then we discuss the relation between opinion mining and computational linguistics which is a crucial pre-processing step for the accuracy of the subsequent steps of opinion mining. The second part of our thesis deals with the semantics of opinions where we describe the different ways used to collect lists of opinion words as well as the methods and techniques available for extracting knowledge from opinions present in unstructured textual data. In the part about collecting lists of opinion words we describe manual, semi manual and automatic ways to do so and give a review of the available lists that are used as gold standards in opinion mining research. For the methods and techniques of opinion mining we divide the task into three levels that are the document, sentence and feature level. The techniques that are presented in the document and sentence level are divided into supervised and unsupervised approaches that are used to determine the subjectivity and polarity of texts and sentences at these levels of analysis. At the feature level we give a description of the techniques available for finding the opinion targets, the polarity of the opinions about these opinion targets and the opinion holders. Also at the feature level we discuss the various ways to summarize and visualize the results of this level of analysis. In the third part of our thesis we present a case study of a sales management system that uses free form text and that can benefit from an opinion mining system. Using the knowledge gathered in the review of this field we provide a theoretical multi level opinion mining system (MLOM) that can perform most of the tasks needed from an opinion mining system. Based on the previous research we give some hints that many of the laborious market research tasks that are done by the sales force, which uses this sales management system, can improve their insight about their partners and by that increase the quality of their sales services and their overall results.
Resumo:
Marketing scholars have suggested a need for more empirical research on consumer response to malls, in order to have a better understanding of the variables that explain the behavior of the consumers. The segmentation methodology CHAID (Chi-square automatic interaction detection) was used in order to identify the profiles of consumers with regard to their activities at malls, on the basis of socio-demographic variables and behavioral variables (how and with whom they go to the malls). A sample of 790 subjects answered an online questionnaire. The CHAID analysis of the results was used to identify the profiles of consumers with regard to their activities at malls. In the set of variables analyzed the transport used in order to go shopping and the frequency of visits to centers are the main predictors of behavior in malls. The results provide guidelines for the development of effective strategies to attract consumers to malls and retain them there.
Resumo:
In the previous issue of IJEMR, we introduced the general framework and the main ideas justifying this special editorial project. To avoid repetition of the background themes to the current issue, the reader should consult the previous edition. Here, we present the second part of contributions selected for publication.
Resumo:
Con este proyecto editorial nuestro objetivo es promover un campo de investigación clave en la comercialización de hoy, es decir, la evolución de la mentalidad e-marketing hacia el nuevo modelo de web social.
Resumo:
Raw measurement data does not always immediately convey useful information, but applying mathematical statistical analysis tools into measurement data can improve the situation. Data analysis can offer benefits like acquiring meaningful insight from the dataset, basing critical decisions on the findings, and ruling out human bias through proper statistical treatment. In this thesis we analyze data from an industrial mineral processing plant with the aim of studying the possibility of forecasting the quality of the final product, given by one variable, with a model based on the other variables. For the study mathematical tools like Qlucore Omics Explorer (QOE) and Sparse Bayesian regression (SB) are used. Later on, linear regression is used to build a model based on a subset of variables that seem to have most significant weights in the SB model. The results obtained from QOE show that the variable representing the desired final product does not correlate with other variables. For SB and linear regression, the results show that both SB and linear regression models built on 1-day averaged data seriously underestimate the variance of true data, whereas the two models built on 1-month averaged data are reliable and able to explain a larger proportion of variability in the available data, making them suitable for prediction purposes. However, it is concluded that no single model can fit well the whole available dataset and therefore, it is proposed for future work to make piecewise non linear regression models if the same available dataset is used, or the plant to provide another dataset that should be collected in a more systematic fashion than the present data for further analysis.
Resumo:
App Engine on lyhenne englanninkielisistä termeistä application, sovellus ja engine, moottori. Kyseessä on Google, Inc. -konsernin toteuttama kaupallinen palvelu, joka noudattaa pilvimallin tietojenkäsittelyn periaatteita ja mahdollistaa asiakkaan oman sovelluskehityksen. Järjestelmään on mahdollista ohjelmoida itse ideoitu palvelu Internet - verkon välityksellä käytettäväksi, joko yksityisesti tai julkisesti. Kyse on siis hajautetusta palvelinjärjestelmästä, jonka tarjoaa dynaamisesti kuormitukseen sopeutuvan sovellusalustan, jossa asiakas ei vuokraa virtuaalikoneita. Myös järjestelmän tarjoama tallennuskapasiteetti on saatavilla joustavasti. Itse kandidaatintyössä syvennytään yksityiskohtaisemmin sovelluksen toteuttamiseen palvelussa, rajoitteisiin ja soveltuvuuteen. Alussa käydään läpi pilvikäsite, joista monilla tietokoneiden käyttäjillä on epäselvä käsitys. Erilaisia kokonaisuuksia voidaan luoda erittäin monella tavalla, joista rajaamme käsittelyn kohteeksi toteuttamiskelpoiset yleiset ratkaisut.
Resumo:
Browsing the web has become one of the most important features in high end mobile phones and in the future more and more people will be using mobile phone for web browsing. Large touchscreens improve browsing experience but many web sites are designed to be used with a mouse. A touchscreen differs substantially from a mouse as a pointing device and therefore mouse emulation logic is required in the browsers to make more web sites usable. This Master's thesis lists the most significant cases where the differences of a mouse and a touchscreen affect web browsing. Five touchscreen mobile phones and their web browsers were evaluated to find out if and how these cases are handled in them. Also as a part of this thesis, a simple QtWebKit based mobile web browser with advanced mouse emulation model was implemented, aiming to solve all the problematic cases. The conclusion of this work is that it is feasible to emulate a mouse with a touchscreen and thus deliver good user experience in mobile web browsing. However, current highend touchscreen mobile phones have relatively underdeveloped mouse emulations in their web browsers and there is a lot to improve.
Resumo:
This work is devoted to the analysis of signal variation of the Cross-Direction and Machine-Direction measurements from paper web. The data that we possess comes from the real paper machine. Goal of the work is to reconstruct the basis weight structure of the paper and to predict its behaviour to the future. The resulting synthetic data is needed for simulation of paper web. The main idea that we used for describing the basis weight variation in the Cross-Direction is Empirical Orthogonal Functions (EOF) algorithm, which is closely related to Principal Component Analysis (PCA) method. Signal forecasting in time is based on Time-Series analysis. Two principal mathematical procedures that we used in the work are Autoregressive-Moving Average (ARMA) modelling and Ornstein–Uhlenbeck (OU) process.
Resumo:
This study aimed at identifying different conditions of coffee plants after harvesting period, using data mining and spectral behavior profiles from Hyperion/EO1 sensor. The Hyperion image, with spatial resolution of 30 m, was acquired in August 28th, 2008, at the end of the coffee harvest season in the studied area. For pre-processing imaging, atmospheric and signal/noise effect corrections were carried out using Flaash and MNF (Minimum Noise Fraction Transform) algorithms, respectively. Spectral behavior profiles (38) of different coffee varieties were generated from 150 Hyperion bands. The spectral behavior profiles were analyzed by Expectation-Maximization (EM) algorithm considering 2; 3; 4 and 5 clusters. T-test with 5% of significance was used to verify the similarity among the wavelength cluster means. The results demonstrated that it is possible to separate five different clusters, which were comprised by different coffee crop conditions making possible to improve future intervention actions.
Resumo:
Locomotor problems prevent the bird to move freely, jeopardizing the welfare and productivity, besides generating injuries on the legs of chickens. The objective of this study was to evaluate the influence of age, use of vitamin D, the asymmetry of limbs and gait score, the degree of leg injuries in broilers, using data mining. The analysis was performed on a data set obtained from a field experiment in which it was used two groups of birds with 30 birds each, a control group and one treated with vitamin D. It was evaluated the gait score, the asymmetry between the right and left toes, and the degree of leg injuries. The Weka ® software was used in data mining. In particular, C4.5 algorithm (also known as J48 in Weka environment) was used for the generation of a decision tree. The results showed that age is the factor that most influences the degree of leg injuries and that the data from assessments of gait score were not reliable to estimate leg weakness in broilers.