838 resultados para text and data mining


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The main objective of this Master Thesis is to discover more about Girona’s image as a tourism destination from different agents’ perspective and to study its differences on promotion or opinions. In order to meet this objective, three components of Girona’s destination image will be studied: attribute-based component, the holistic component, and the affective component. It is true that a lot of research has been done about tourism destination image, but it is less when we are talking about the destination of Girona. Some studies have already focused on Girona as a tourist destination, but they used a different type of sample and different methodological steps. This study is new among destination studies in the sense that it is based only on textual online data and it follows a methodology based on text-miming. Text-mining is a kind of methodology that allows people extract relevant information from texts. Also, after this information is extracted by this methodology, some statistical multivariate analyses are done with the aim of discovering more about Girona’s tourism image

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language. This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains. The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time. To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization. To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships. Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Abstract This seminar is a research discussion around a very interesting problem, which may be a good basis for a WAISfest theme. A little over a year ago Professor Alan Dix came to tell us of his plans for a magnificent adventure:to walk all of the way round Wales - 1000 miles 'Alan Walks Wales'. The walk was a personal journey, but also a technological and community one, exploring the needs of the walker and the people along the way. Whilst walking he recorded his thoughts in an audio diary, took lots of photos, wrote a blog and collected data from the tech instruments he was wearing. As a result Alan has extensive quantitative data (bio-sensing and location) and qualitative data (text, images and some audio). There are challenges in analysing individual kinds of data, including merging similar data streams, entity identification, time-series and textual data mining, dealing with provenance, ontologies for paths, and journeys. There are also challenges for author and third-party annotation, linking the data-sets and visualising the merged narrative or facets of it.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Title: Data-Driven Text Generation using Neural Networks Speaker: Pavlos Vougiouklis, University of Southampton Abstract: Recent work on neural networks shows their great potential at tackling a wide variety of Natural Language Processing (NLP) tasks. This talk will focus on the Natural Language Generation (NLG) problem and, more specifically, on the extend to which neural network language models could be employed for context-sensitive and data-driven text generation. In addition, a neural network architecture for response generation in social media along with the training methods that enable it to capture contextual information and effectively participate in public conversations will be discussed. Speaker Bio: Pavlos Vougiouklis obtained his 5-year Diploma in Electrical and Computer Engineering from the Aristotle University of Thessaloniki in 2013. He was awarded an MSc degree in Software Engineering from the University of Southampton in 2014. In 2015, he joined the Web and Internet Science (WAIS) research group of the University of Southampton and he is currently working towards the acquisition of his PhD degree in the field of Neural Network Approaches for Natural Language Processing. Title: Provenance is Complicated and Boring — Is there a solution? Speaker: Darren Richardson, University of Southampton Abstract: Paper trails, auditing, and accountability — arguably not the sexiest terms in computer science. But then you discover that you've possibly been eating horse-meat, and the importance of provenance becomes almost palpable. Having accepted that we should be creating provenance-enabled systems, the challenge of then communicating that provenance to casual users is not trivial: users should not have to have a detailed working knowledge of your system, and they certainly shouldn't be expected to understand the data model. So how, then, do you give users an insight into the provenance, without having to build a bespoke system for each and every different provenance installation? Speaker Bio: Darren is a final year Computer Science PhD student. He completed his undergraduate degree in Electronic Engineering at Southampton in 2012.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

OBJECTIVES: The prediction of protein structure and the precise understanding of protein folding and unfolding processes remains one of the greatest challenges in structural biology and bioinformatics. Computer simulations based on molecular dynamics (MD) are at the forefront of the effort to gain a deeper understanding of these complex processes. Currently, these MD simulations are usually on the order of tens of nanoseconds, generate a large amount of conformational data and are computationally expensive. More and more groups run such simulations and generate a myriad of data, which raises new challenges in managing and analyzing these data. Because the vast range of proteins researchers want to study and simulate, the computational effort needed to generate data, the large data volumes involved, and the different types of analyses scientists need to perform, it is desirable to provide a public repository allowing researchers to pool and share protein unfolding data. METHODS: To adequately organize, manage, and analyze the data generated by unfolding simulation studies, we designed a data warehouse system that is embedded in a grid environment to facilitate the seamless sharing of available computer resources and thus enable many groups to share complex molecular dynamics simulations on a more regular basis. RESULTS: To gain insight into the conformational fluctuations and stability of the monomeric forms of the amyloidogenic protein transthyretin (TTR), molecular dynamics unfolding simulations of the monomer of human TTR have been conducted. Trajectory data and meta-data of the wild-type (WT) protein and the highly amyloidogenic variant L55P-TTR represent the test case for the data warehouse. CONCLUSIONS: Web and grid services, especially pre-defined data mining services that can run on or 'near' the data repository of the data warehouse, are likely to play a pivotal role in the analysis of molecular dynamics unfolding data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this article, we review the state-of-the-art techniques in mining data streams for mobile and ubiquitous environments. We start the review with a concise background of data stream processing, presenting the building blocks for mining data streams. In a wide range of applications, data streams are required to be processed on small ubiquitous devices like smartphones and sensor devices. Mobile and ubiquitous data mining target these applications with tailored techniques and approaches addressing scarcity of resources and mobility issues. Two categories can be identified for mobile and ubiquitous mining of streaming data: single-node and distributed. This survey will cover both categories. Mining mobile and ubiquitous data require algorithms with the ability to monitor and adapt the working conditions to the available computational resources. We identify the key characteristics of these algorithms and present illustrative applications. Distributed data stream mining in the mobile environment is then discussed, presenting the Pocket Data Mining framework. Mobility of users stimulates the adoption of context-awareness in this area of research. Context-awareness and collaboration are discussed in the Collaborative Data Stream Mining, where agents share knowledge to learn adaptive accurate models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The main purpose of this thesis project is to prediction of symptom severity and cause in data from test battery of the Parkinson’s disease patient, which is based on data mining. The collection of the data is from test battery on a hand in computer. We use the Chi-Square method and check which variables are important and which are not important. Then we apply different data mining techniques on our normalize data and check which technique or method gives good results.The implementation of this thesis is in WEKA. We normalize our data and then apply different methods on this data. The methods which we used are Naïve Bayes, CART and KNN. We draw the Bland Altman and Spearman’s Correlation for checking the final results and prediction of data. The Bland Altman tells how the percentage of our confident level in this data is correct and Spearman’s Correlation tells us our relationship is strong. On the basis of results and analysis we see all three methods give nearly same results. But if we see our CART (J48 Decision Tree) it gives good result of under predicted and over predicted values that’s lies between -2 to +2. The correlation between the Actual and Predicted values is 0,794in CART. Cause gives the better percentage classification result then disability because it can use two classes.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

An Automatic Vehicle Location (AVL) system is a computer-based vehicle tracking system that is capable of determining a vehicle's location in real time. As a major technology of the Advanced Public Transportation System (APTS), AVL systems have been widely deployed by transit agencies for purposes such as real-time operation monitoring, computer-aided dispatching, and arrival time prediction. AVL systems make a large amount of transit performance data available that are valuable for transit performance management and planning purposes. However, the difficulties of extracting useful information from the huge spatial-temporal database have hindered off-line applications of the AVL data. ^ In this study, a data mining process, including data integration, cluster analysis, and multiple regression, is proposed. The AVL-generated data are first integrated into a Geographic Information System (GIS) platform. The model-based cluster method is employed to investigate the spatial and temporal patterns of transit travel speeds, which may be easily translated into travel time. The transit speed variations along the route segments are identified. Transit service periods such as morning peak, mid-day, afternoon peak, and evening periods are determined based on analyses of transit travel speed variations for different times of day. The seasonal patterns of transit performance are investigated by using the analysis of variance (ANOVA). Travel speed models based on the clustered time-of-day intervals are developed using important factors identified as having significant effects on speed for different time-of-day periods. ^ It has been found that transit performance varied from different seasons and different time-of-day periods. The geographic location of a transit route segment also plays a role in the variation of the transit performance. The results of this research indicate that advanced data mining techniques have good potential in providing automated techniques of assisting transit agencies in service planning, scheduling, and operations control. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the proliferation of multimedia data and ever-growing requests for multimedia applications, there is an increasing need for efficient and effective indexing, storage and retrieval of multimedia data, such as graphics, images, animation, video, audio and text. Due to the special characteristics of the multimedia data, the Multimedia Database management Systems (MMDBMSs) have emerged and attracted great research attention in recent years. Though much research effort has been devoted to this area, it is still far from maturity and there exist many open issues. In this dissertation, with the focus of addressing three of the essential challenges in developing the MMDBMS, namely, semantic gap, perception subjectivity and data organization, a systematic and integrated framework is proposed with video database and image database serving as the testbed. In particular, the framework addresses these challenges separately yet coherently from three main aspects of a MMDBMS: multimedia data representation, indexing and retrieval. In terms of multimedia data representation, the key to address the semantic gap issue is to intelligently and automatically model the mid-level representation and/or semi-semantic descriptors besides the extraction of the low-level media features. The data organization challenge is mainly addressed by the aspect of media indexing where various levels of indexing are required to support the diverse query requirements. In particular, the focus of this study is to facilitate the high-level video indexing by proposing a multimodal event mining framework associated with temporal knowledge discovery approaches. With respect to the perception subjectivity issue, advanced techniques are proposed to support users' interaction and to effectively model users' perception from the feedback at both the image-level and object-level.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the last decade, large numbers of social media services have emerged and been widely used in people's daily life as important information sharing and acquisition tools. With a substantial amount of user-contributed text data on social media, it becomes a necessity to develop methods and tools for text analysis for this emerging data, in order to better utilize it to deliver meaningful information to users. Previous work on text analytics in last several decades is mainly focused on traditional types of text like emails, news and academic literatures, and several critical issues to text data on social media have not been well explored: 1) how to detect sentiment from text on social media; 2) how to make use of social media's real-time nature; 3) how to address information overload for flexible information needs. In this dissertation, we focus on these three problems. First, to detect sentiment of text on social media, we propose a non-negative matrix tri-factorization (tri-NMF) based dual active supervision method to minimize human labeling efforts for the new type of data. Second, to make use of social media's real-time nature, we propose approaches to detect events from text streams on social media. Third, to address information overload for flexible information needs, we propose two summarization framework, dominating set based summarization framework and learning-to-rank based summarization framework. The dominating set based summarization framework can be applied for different types of summarization problems, while the learning-to-rank based summarization framework helps utilize the existing training data to guild the new summarization tasks. In addition, we integrate these techneques in an application study of event summarization for sports games as an example of how to better utilize social media data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Buchans ore bodies of central Newfoundland represent some of the highest grade VMS deposits ever mined. These Kuroko-type deposits are also known for the well developed and preserved nature of the mechanically transported deposits. The deposits are hosted in Cambro-Ordovician, dominantly calc-alkaline, bimodal volcanic and epiclastic sequences of the Notre Dame Subzone, Newfoundland Appalachians. Stratigraphic relationships in this zone are complicated by extensively developed, brittledominated Silurian thrust faulting. Hydrothermal alteration of host rocks is a common feature of nearly all VMS deposits, and the recognition of these zones has been a key exploration tool. Alteration of host rocks has long been described to be spatially associated with the Buchans ore bodies, most notably with the larger in-situ deposits. This report represents a base-line study in which a complete documentation of the geochemical variance, in terms of both primary (igneous) and alteration effects, is presented from altered volcanic rocks in the vicinity of the Lucky Strike deposit (LSZ), the largest in-situ deposit in the Buchans camp. Packages of altered rocks also occur away from the immediate mining areas and constitute new targets for exploration. These zones, identified mostly by recent and previous drilling, represent untested targets and include the Powerhouse (PHZ), Woodmans Brook (WBZ) and Airport (APZ) alteration zones, as well as the Middle Branch alteration zone (MBZ), which represents a more distal alteration facies related to Buchans ore-formation. Data from each of these zones were compared to those from the LSZ in order to evaluate their relative propectivity. Derived litho geochemical data served two functions: (i) to define primary (igneous) trends and (ii) secondary alteration trends. Primary trends were established using immobile, or conservative, elements (i. e., HFSE, REE, Th, Ti0₂, Al₂0₃, P₂0₅). From these, altered volcanic rocks were interpreted in terms of composition (e.g., basalt - rhyodacite) and magmatic affinity (e.g., calc-alkaline vs. tholeiitic). The information suggests that bimodality is a common feature of all zones, with most rocks plotting as either basalt/andesite or dacite (or rhyodacite); andesitic senso stricto compositions are rare. Magmatic affinities are more varied and complex, but indicate that all units are arc volcanic sequences. Rocks from the LSZ/MBZ represent a transitional to calc-alkalic sequence, however, a slight shift in key geochemical discriminants occurs between the foot-wall to the hanging-wall. Specifically, mafic and felsic lavas of the foot-wall are of transitional (or mildly calc-alkaline) affinity whereas the hanging-wall rocks are relatively more strongly calc-alkaline as indicated by enriched LREE/HREE and higher ZrN, NbN and other ratios in the latter. The geochemical variations also serve as a means to separate the units (at least the felsic rocks) into hanging-wall and foot-wall sequences, therefore providing a valuable exploration tool. Volcanic rocks from the WBZ/PHZ (and probably the APZ) are more typical of tholeiitic to transitional suites, yielding flatter mantlenormalized REE patterns and lower ZrN ratios. Thus, the relationships between the immediate mining area (represented by LSZ/MBZ) and the Buchans East (PHZ/WBZ) and the APZ are uncertain. Host rocks for all zones consist of mafic to felsic volcanic rocks, though the proportion of pyroclastic and epiclastic rocks, is greatest at the LSZ. Phenocryst assemblages and textures are common in all zones, with minor exceptions, and are not useful for discrimination purposes. Felsic rocks from all zones are dominated by sericiteclay+/- silica alteration, whereas mafic rocks are dominated by chlorite- quartz- sericite alteration. Pyrite is ubiquitous in all moderately altered rocks and minor associated base metal sulphides occur locally. The exception is at Lucky Strike, where stockwork quartzveining contains abundant base-metal mineralization and barite. Rocks completely comprised of chlorite (chloritite) also occur in the LSZ foot-wall. In addition, K-feldspar alteration occurs in felsic volcanic rocks at the MBZ associated with Zn-Pb-Ba and, notably, without chlorite. This zone represents a peripheral, but proximal, zone of alteration induced by lower temperature hydrothermal fluids, presumably with little influence from seawater. Alteration geochemistry was interpreted from raw data as well as from mass balanced (recalculated) data derived from immobile element pairs. The data from the LSZ/MBZ indicate a range in the degree of alteration from only minor to severe modification of precursor compositions. Ba tends to show a strong positive correlation with K₂0, although most Ba occurs as barite. With respect to mass changes, Al₂0₃, Ti0₂ and P₂0₅ were shown to be immobile. Nearly all rocks display mass loss of Na₂O, CaO, and Sr reflecting feldspar destruction. These trends are usually mirrored by K₂0-Rb and MgO addition, indicating sericitic and chloritic alteration, respectively. More substantial gains ofK₂0 often occur in rocks with K-feldspar alteration, whereas a few samples also displayed excessive MgO enrichment and represent chloritites. Fe₂0₃ indicates both chlorite and sulphide formation. Si0₂ addition is almost always the case for the altered mafic rocks as silica often infills amygdules and replaces the finer tuffaceous material. The felsic rocks display more variability in Si0₂. Silicic, sericitic and chloritic alteration trends were observed from the other zones, but not K-feldspar, chloritite, or barite. Microprobe analysis of chlorites, sericites and carbonates indicate: (i) sericites from all zones are defined as muscovite and are not phengitic; (ii) at the LSZ, chlorites ranged from Fe-Mg chlorites (pycnochlorite) to Mg-rich chlorite (penninite), with the latter occurring in the stockwork zone and more proximal alteration facies; (iii) chlorites from the WBZ were typical of those from the more distal alteration facies of the LSZ, plotting as ripidolite to pycnochlorite; (iv) conversely, chlorite from the PHZ plot with Mg-Al-rich compositions (chlinochlore to penninite); and (v) carbonate species from each zone are also varied, with calcite occurring in each zone, in addition to dolomite and ankerite in the PHZ and WBZ, respectively. Lead isotope ratios for galena separates from the different various zones, when combined with data from older studies, tend to cluster into four distinctive fields. Overall, the data plot on a broad mixing line and indicate evolution in a relatively low-μ environment. Data from sulphide stringers in altered MBZ rocks, as well as from clastic sulphides (Sandfill prospect), plot in the Buchans ore field, as do the data for galena from altered rocks in the APZ. Samples from the Buchans East area are even more primitive than the Buchans ores, with lead from the PHZ plotting with the Connel Option prospect and data from the WBZ matching that of the Skidder prospect. A sample from a newly discovered debris flow-type sulphide occurrence (Middle Branch East) yields lead isotope ratios that are slightly more radiogenic than Buchans and plot with the Mary March alteration zone. Data within each cluster are interpreted to represent derivation from individual hydrothermal systems in which metals were derived from a common source.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Peer reviewed

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Peer reviewed