108 resultados para Text mining, Classificazione, Stemming, Text categorization


Relevância:

50.00% 50.00%

Publicador:

Resumo:

Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000. The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built. In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

This thesis presents a promising boundary setting method for solving challenging issues in text classification to produce an effective text classifier. A classifier must identify boundary between classes optimally. However, after the features are selected, the boundary is still unclear with regard to mixed positive and negative documents. A classifier combination method to boost effectiveness of the classification model is also presented. The experiments carried out in the study demonstrate that the proposed classifier is promising.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Traditional text classification technology based on machine learning and data mining techniques has made a big progress. However, it is still a big problem on how to draw an exact decision boundary between relevant and irrelevant objects in binary classification due to much uncertainty produced in the process of the traditional algorithms. The proposed model CTTC (Centroid Training for Text Classification) aims to build an uncertainty boundary to absorb as many indeterminate objects as possible so as to elevate the certainty of the relevant and irrelevant groups through the centroid clustering and training process. The clustering starts from the two training subsets labelled as relevant or irrelevant respectively to create two principal centroid vectors by which all the training samples are further separated into three groups: POS, NEG and BND, with all the indeterminate objects absorbed into the uncertain decision boundary BND. Two pairs of centroid vectors are proposed to be trained and optimized through the subsequent iterative multi-learning process, all of which are proposed to collaboratively help predict the polarities of the incoming objects thereafter. For the assessment of the proposed model, F1 and Accuracy have been chosen as the key evaluation measures. We stress the F1 measure because it can display the overall performance improvement of the final classifier better than Accuracy. A large number of experiments have been completed using the proposed model on the Reuters Corpus Volume 1 (RCV1) which is important standard dataset in the field. The experiment results show that the proposed model has significantly improved the binary text classification performance in both F1 and Accuracy compared with three other influential baseline models.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In this article, we take a close look at the literacy demands of one task from the ‘Marvellous Micro-organisms Stage 3 Life and Living’ Primary Connections unit (Australian Academy of Science, 2005). One lesson from the unit, ‘Exploring Bread’, (pp 4-8) asks students to ‘use bread labels to locate ingredient information and synthesise understanding of bread ingredients’. We draw upon a framework offered by the New London Group (2000), that of linguistic, visual and spatial design, to consider in more detail three bread wrappers and from there the complex literacies that students need to interrelate to undertake the required task. Our findings are that although bread wrappers are an example of an everyday science text, their linguistic, visual and spatial designs and their interrelationship are not trivial. We conclude by reinforcing the need for teachers of science to also consider how the complex design elements of everyday science texts and their interrelated literacies are made visible through instructional practice.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

With the advent of Service Oriented Architecture, Web Services have gained tremendous popularity. Due to the availability of a large number of Web services, finding an appropriate Web service according to the requirement of the user is a challenge. This warrants the need to establish an effective and reliable process of Web service discovery. A considerable body of research has emerged to develop methods to improve the accuracy of Web service discovery to match the best service. The process of Web service discovery results in suggesting many individual services that partially fulfil the user’s interest. By considering the semantic relationships of words used in describing the services as well as the use of input and output parameters can lead to accurate Web service discovery. Appropriate linking of individual matched services should fully satisfy the requirements which the user is looking for. This research proposes to integrate a semantic model and a data mining technique to enhance the accuracy of Web service discovery. A novel three-phase Web service discovery methodology has been proposed. The first phase performs match-making to find semantically similar Web services for a user query. In order to perform semantic analysis on the content present in the Web service description language document, the support-based latent semantic kernel is constructed using an innovative concept of binning and merging on the large quantity of text documents covering diverse areas of domain of knowledge. The use of a generic latent semantic kernel constructed with a large number of terms helps to find the hidden meaning of the query terms which otherwise could not be found. Sometimes a single Web service is unable to fully satisfy the requirement of the user. In such cases, a composition of multiple inter-related Web services is presented to the user. The task of checking the possibility of linking multiple Web services is done in the second phase. Once the feasibility of linking Web services is checked, the objective is to provide the user with the best composition of Web services. In the link analysis phase, the Web services are modelled as nodes of a graph and an allpair shortest-path algorithm is applied to find the optimum path at the minimum cost for traversal. The third phase which is the system integration, integrates the results from the preceding two phases by using an original fusion algorithm in the fusion engine. Finally, the recommendation engine which is an integral part of the system integration phase makes the final recommendations including individual and composite Web services to the user. In order to evaluate the performance of the proposed method, extensive experimentation has been performed. Results of the proposed support-based semantic kernel method of Web service discovery are compared with the results of the standard keyword-based information-retrieval method and a clustering-based machine-learning method of Web service discovery. The proposed method outperforms both information-retrieval and machine-learning based methods. Experimental results and statistical analysis also show that the best Web services compositions are obtained by considering 10 to 15 Web services that are found in phase-I for linking. Empirical results also ascertain that the fusion engine boosts the accuracy of Web service discovery by combining the inputs from both the semantic analysis (phase-I) and the link analysis (phase-II) in a systematic fashion. Overall, the accuracy of Web service discovery with the proposed method shows a significant improvement over traditional discovery methods.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The recent focus on literacy in Social Studies has been on linguistic design, particularly that related to the grammar of written and spoken text. When students are expected to produce complex hybridized genres such as timelines, a focus on the teaching and learning of linguistic design is necessary but not sufficient to complete the task. Theorizations of new literacies identify five interrelated meaning making designs for text deconstruction and reproduction: linguistic, spatial, visual, gestural, and audio design. Honing in on the complexity of timelines, this paper casts a lens on the linguistic, visual, spatial, and gestural designs of three pairs of primary school aged Social Studies learners. Drawing on a functional metalanguage, we analyze the linguistic, visual, spatial, and gestural designs of their work. We also offer suggestions of their effect, and from there consider the importance of explicit instruction in text design choices for this Social Studies task. We conclude the analysis by suggesting the foci of explicit instruction for future lessons.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

It is a big challenge to clearly identify the boundary between positive and negative streams. Several attempts have used negative feedback to solve this challenge; however, there are two issues for using negative relevance feedback to improve the effectiveness of information filtering. The first one is how to select constructive negative samples in order to reduce the space of negative documents. The second issue is how to decide noisy extracted features that should be updated based on the selected negative samples. This paper proposes a pattern mining based approach to select some offenders from the negative documents, where an offender can be used to reduce the side effects of noisy features. It also classifies extracted features (i.e., terms) into three categories: positive specific terms, general terms, and negative specific terms. In this way, multiple revising strategies can be used to update extracted features. An iterative learning algorithm is also proposed to implement this approach on RCV1, and substantial experiments show that the proposed approach achieves encouraging performance.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

An information filtering (IF) system monitors an incoming document stream to find the documents that match the information needs specified by the user profiles. To learn to use the user profiles effectively is one of the most challenging tasks when developing an IF system. With the document selection criteria better defined based on the users’ needs, filtering large streams of information can be more efficient and effective. To learn the user profiles, term-based approaches have been widely used in the IF community because of their simplicity and directness. Term-based approaches are relatively well established. However, these approaches have problems when dealing with polysemy and synonymy, which often lead to an information overload problem. Recently, pattern-based approaches (or Pattern Taxonomy Models (PTM) [160]) have been proposed for IF by the data mining community. These approaches are better at capturing sematic information and have shown encouraging results for improving the effectiveness of the IF system. On the other hand, pattern discovery from large data streams is not computationally efficient. Also, these approaches had to deal with low frequency pattern issues. The measures used by the data mining technique (for example, “support” and “confidences”) to learn the profile have turned out to be not suitable for filtering. They can lead to a mismatch problem. This thesis uses the rough set-based reasoning (term-based) and pattern mining approach as a unified framework for information filtering to overcome the aforementioned problems. This system consists of two stages - topic filtering and pattern mining stages. The topic filtering stage is intended to minimize information overloading by filtering out the most likely irrelevant information based on the user profiles. A novel user-profiles learning method and a theoretical model of the threshold setting have been developed by using rough set decision theory. The second stage (pattern mining) aims at solving the problem of the information mismatch. This stage is precision-oriented. A new document-ranking function has been derived by exploiting the patterns in the pattern taxonomy. The most likely relevant documents were assigned higher scores by the ranking function. Because there is a relatively small amount of documents left after the first stage, the computational cost is markedly reduced; at the same time, pattern discoveries yield more accurate results. The overall performance of the system was improved significantly. The new two-stage information filtering model has been evaluated by extensive experiments. Tests were based on the well-known IR bench-marking processes, using the latest version of the Reuters dataset, namely, the Reuters Corpus Volume 1 (RCV1). The performance of the new two-stage model was compared with both the term-based and data mining-based IF models. The results demonstrate that the proposed information filtering system outperforms significantly the other IF systems, such as the traditional Rocchio IF model, the state-of-the-art term-based models, including the BM25, Support Vector Machines (SVM), and Pattern Taxonomy Model (PTM).

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The wide range of contributing factors and circumstances surrounding crashes on road curves suggest that no single intervention can prevent these crashes. This paper presents a novel methodology, based on data mining techniques, to identify contributing factors and the relationship between them. It identifies contributing factors that influence the risk of a crash. Incident records, described using free text, from a large insurance company were analysed with rough set theory. Rough set theory was used to discover dependencies among data, and reasons using the vague, uncertain and imprecise information that characterised the insurance dataset. The results show that male drivers, who are between 50 and 59 years old, driving during evening peak hours are involved with a collision, had a lowest crash risk. Drivers between 25 and 29 years old, driving from around midnight to 6 am and in a new car has the highest risk. The analysis of the most significant contributing factors on curves suggests that drivers with driving experience of 25 to 42 years, who are driving a new vehicle have the highest crash cost risk, characterised by the vehicle running off the road and hitting a tree. This research complements existing statistically based tools approach to analyse road crashes. Our data mining approach is supported with proven theory and will allow road safety practitioners to effectively understand the dependencies between contributing factors and the crash type with the view to designing tailored countermeasures.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Over the last decade, the rapid growth and adoption of the World Wide Web has further exacerbated user needs for e±cient mechanisms for information and knowledge location, selection, and retrieval. How to gather useful and meaningful information from the Web becomes challenging to users. The capture of user information needs is key to delivering users' desired information, and user pro¯les can help to capture information needs. However, e®ectively acquiring user pro¯les is di±cult. It is argued that if user background knowledge can be speci¯ed by ontolo- gies, more accurate user pro¯les can be acquired and thus information needs can be captured e®ectively. Web users implicitly possess concept models that are obtained from their experience and education, and use the concept models in information gathering. Prior to this work, much research has attempted to use ontologies to specify user background knowledge and user concept models. However, these works have a drawback in that they cannot move beyond the subsumption of super - and sub-class structure to emphasising the speci¯c se- mantic relations in a single computational model. This has also been a challenge for years in the knowledge engineering community. Thus, using ontologies to represent user concept models and to acquire user pro¯les remains an unsolved problem in personalised Web information gathering and knowledge engineering. In this thesis, an ontology learning and mining model is proposed to acquire user pro¯les for personalised Web information gathering. The proposed compu- tational model emphasises the speci¯c is-a and part-of semantic relations in one computational model. The world knowledge and users' Local Instance Reposito- ries are used to attempt to discover and specify user background knowledge. From a world knowledge base, personalised ontologies are constructed by adopting au- tomatic or semi-automatic techniques to extract user interest concepts, focusing on user information needs. A multidimensional ontology mining method, Speci- ¯city and Exhaustivity, is also introduced in this thesis for analysing the user background knowledge discovered and speci¯ed in user personalised ontologies. The ontology learning and mining model is evaluated by comparing with human- based and state-of-the-art computational models in experiments, using a large, standard data set. The experimental results are promising for evaluation. The proposed ontology learning and mining model in this thesis helps to develop a better understanding of user pro¯le acquisition, thus providing better design of personalised Web information gathering systems. The contributions are increasingly signi¯cant, given both the rapid explosion of Web information in recent years and today's accessibility to the Internet and the full text world.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Globalisation and societal change suggest the language and literacy skills needed to make meaning in our lives are increasing and changing radically. Multiliteracies are influencing the future of literacy teaching. One aspect of the pedagogy of multiliteracies is recruiting learners’ previous and current experiences as an integral part of the learning experience. This paper examines the implications of results from a project that examined student responses to a postmodern picture book, in particular, ways teachers might develop students’ self-knowledge about reading. It draws on Freebody and Luke’s Four Resources Model of Reading and recently developed models for teaching multiliteracies.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Road curves are an important feature of road infrastructure and many serious crashes occur on road curves. In Queensland, the number of fatalities is twice as many on curves as that on straight roads. Therefore, there is a need to reduce drivers’ exposure to crash risk on road curves. Road crashes in Australia and in the Organisation for Economic Co-operation and Development(OECD) have plateaued in the last five years (2004 to 2008) and the road safety community is desperately seeking innovative interventions to reduce the number of crashes. However, designing an innovative and effective intervention may prove to be difficult as it relies on providing theoretical foundation, coherence, understanding, and structure to both the design and validation of the efficiency of the new intervention. Researchers from multiple disciplines have developed various models to determine the contributing factors for crashes on road curves with a view towards reducing the crash rate. However, most of the existing methods are based on statistical analysis of contributing factors described in government crash reports. In order to further explore the contributing factors related to crashes on road curves, this thesis designs a novel method to analyse and validate these contributing factors. The use of crash claim reports from an insurance company is proposed for analysis using data mining techniques. To the best of our knowledge, this is the first attempt to use data mining techniques to analyse crashes on road curves. Text mining technique is employed as the reports consist of thousands of textual descriptions and hence, text mining is able to identify the contributing factors. Besides identifying the contributing factors, limited studies to date have investigated the relationships between these factors, especially for crashes on road curves. Thus, this study proposed the use of the rough set analysis technique to determine these relationships. The results from this analysis are used to assess the effect of these contributing factors on crash severity. The findings obtained through the use of data mining techniques presented in this thesis, have been found to be consistent with existing identified contributing factors. Furthermore, this thesis has identified new contributing factors towards crashes and the relationships between them. A significant pattern related with crash severity is the time of the day where severe road crashes occur more frequently in the evening or night time. Tree collision is another common pattern where crashes that occur in the morning and involves hitting a tree are likely to have a higher crash severity. Another factor that influences crash severity is the age of the driver. Most age groups face a high crash severity except for drivers between 60 and 100 years old, who have the lowest crash severity. The significant relationship identified between contributing factors consists of the time of the crash, the manufactured year of the vehicle, the age of the driver and hitting a tree. Having identified new contributing factors and relationships, a validation process is carried out using a traffic simulator in order to determine their accuracy. The validation process indicates that the results are accurate. This demonstrates that data mining techniques are a powerful tool in road safety research, and can be usefully applied within the Intelligent Transport System (ITS) domain. The research presented in this thesis provides an insight into the complexity of crashes on road curves. The findings of this research have important implications for both practitioners and academics. For road safety practitioners, the results from this research illustrate practical benefits for the design of interventions for road curves that will potentially help in decreasing related injuries and fatalities. For academics, this research opens up a new research methodology to assess crash severity, related to road crashes on curves.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Libertine erotic novellas included a number of seductive descriptions of unfolding spaces often seen through the eyes of a narrator. Instructional volumes such as Point de lendermain by Vivant Denon (1777) aimed at the sexual education of young women and the titillation of men also followed suit. Similarly architectural theory such as Le Camus de Mézières’, The Genius of Architecture (1780) also promoted the sensuous and seductive aspects of surfaces and spatial arrangements. In the erotic settings of the cabinet, descriptions of curtains generate as much arousal as the outline of a naked body, and for some players it is the space that is desired above their lover.