942 resultados para Blog datasets
Resumo:
With the dramatic growth of text information, there is an increasing need for powerful text mining systems that can automatically discover useful knowledge from text. Text is generally associated with all kinds of contextual information. Those contexts can be explicit, such as the time and the location where a blog article is written, and the author(s) of a biomedical publication, or implicit, such as the positive or negative sentiment that an author had when she wrote a product review; there may also be complex context such as the social network of the authors. Many applications require analysis of topic patterns over different contexts. For instance, analysis of search logs in the context of the user can reveal how we can improve the quality of a search engine by optimizing the search results according to particular users; analysis of customer reviews in the context of positive and negative sentiments can help the user summarize public opinions about a product; analysis of blogs or scientific publications in the context of a social network can facilitate discovery of more meaningful topical communities. Since context information significantly affects the choices of topics and language made by authors, in general, it is very important to incorporate it into analyzing and mining text data. In general, modeling the context in text, discovering contextual patterns of language units and topics from text, a general task which we refer to as Contextual Text Mining, has widespread applications in text mining. In this thesis, we provide a novel and systematic study of contextual text mining, which is a new paradigm of text mining treating context information as the ``first-class citizen.'' We formally define the problem of contextual text mining and its basic tasks, and propose a general framework for contextual text mining based on generative modeling of text. This conceptual framework provides general guidance on text mining problems with context information and can be instantiated into many real tasks, including the general problem of contextual topic analysis. We formally present a functional framework for contextual topic analysis, with a general contextual topic model and its various versions, which can effectively solve the text mining problems in a lot of real world applications. We further introduce general components of contextual topic analysis, by adding priors to contextual topic models to incorporate prior knowledge, regularizing contextual topic models with dependency structure of context, and postprocessing contextual patterns to extract refined patterns. The refinements on the general contextual topic model naturally lead to a variety of probabilistic models which incorporate different types of context and various assumptions and constraints. These special versions of the contextual topic model are proved effective in a variety of real applications involving topics and explicit contexts, implicit contexts, and complex contexts. We then introduce a postprocessing procedure for contextual patterns, by generating meaningful labels for multinomial context models. This method provides a general way to interpret text mining results for real users. By applying contextual text mining in the ``context'' of other text information management tasks, including ad hoc text retrieval and web search, we further prove the effectiveness of contextual text mining techniques in a quantitative way with large scale datasets. The framework of contextual text mining not only unifies many explorations of text analysis with context information, but also opens up many new possibilities for future research directions in text mining.
Resumo:
La presente investigaci?n surge a partir de datos obtenidos en la Instituci?n Educativa la Buitrera sede Jos? Mar?a Garc?a de Toledo, con estudiantes de aceleraci?n del aprendizaje, metodolog?a conocida como modelo flexible, dirigido a poblaci?n en extraedad y en condici?n de vulnerabilidad. Se elaboran diferentes estrategias que permitan dar cuenta del conocimiento que tienen algunos adolescentes en relaci?n a las ETS y que actividades se utilizar?an para abordar estos temas en la escuela. De las respuestas obtenidas se tiene en cuenta la posibilidad de lograr ense?ar estos temas haciendo uso de las nuevas tecnolog?as; por lo cual se dise?? un material (blog educativo) con los aportes y actividades realizadas con los estudiantes que participaron en la b?squeda de material informativo de las ETS y fundamentados en las teor?as de dise?o Instruccional, el EAC (entorno de aprendizaje constructivista). Para llevar a cabo la propuesta se parte de la construcci?n de unos prop?sitos que con lleven a la realizaci?n del dise?o del blog teniendo en cuenta las necesidades identificadas en los estudiantes. La propuesta metodol?gica fue desarrollada en varias fases: las cuales permitieron realizar una revisi?n bibliogr?fica, la aplicaci?n de un cuestionario, el dise?o de plantillas y el dise?o del blog con un enfoque did?ctico y pedag?gico, que incluye en su dise?o la participaci?n del estudiante en procesos de comunicaci?n y relaci?n social, adem?s se trabajan contenidos espec?ficos sobre ETS, y algunas actividades que fomenten inter?s al estudiante en la participaci?n y aporte de posibles soluciones a problem?ticas relacionadas con el tema a trav?s de la virtualidad. Facilitando su interactividad y comunicaci?n, porque en el aula los j?venes muestran timidez a la hora de dar a conocer sus opiniones acerca de la sexualidad. Finalmente se puede concluir que tanto la metodolog?a (resoluci?n de problemas) escogida para el dise?o del blog favorecen el aprendizaje de la tem?tica de las ETS donde se puede evidenciar los aportes de cada uno de los estudiantes en relaci?n a la construcci?n de un material que les puede dar m?s claridad respecto a la tem?tica.
Resumo:
Visual recognition is a fundamental research topic in computer vision. This dissertation explores datasets, features, learning, and models used for visual recognition. In order to train visual models and evaluate different recognition algorithms, this dissertation develops an approach to collect object image datasets on web pages using an analysis of text around the image and of image appearance. This method exploits established online knowledge resources (Wikipedia pages for text; Flickr and Caltech data sets for images). The resources provide rich text and object appearance information. This dissertation describes results on two datasets. The first is Berg’s collection of 10 animal categories; on this dataset, we significantly outperform previous approaches. On an additional set of 5 categories, experimental results show the effectiveness of the method. Images are represented as features for visual recognition. This dissertation introduces a text-based image feature and demonstrates that it consistently improves performance on hard object classification problems. The feature is built using an auxiliary dataset of images annotated with tags, downloaded from the Internet. Image tags are noisy. The method obtains the text features of an unannotated image from the tags of its k-nearest neighbors in this auxiliary collection. A visual classifier presented with an object viewed under novel circumstances (say, a new viewing direction) must rely on its visual examples. This text feature may not change, because the auxiliary dataset likely contains a similar picture. While the tags associated with images are noisy, they are more stable when appearance changes. The performance of this feature is tested using PASCAL VOC 2006 and 2007 datasets. This feature performs well; it consistently improves the performance of visual object classifiers, and is particularly effective when the training dataset is small. With more and more collected training data, computational cost becomes a bottleneck, especially when training sophisticated classifiers such as kernelized SVM. This dissertation proposes a fast training algorithm called Stochastic Intersection Kernel Machine (SIKMA). This proposed training method will be useful for many vision problems, as it can produce a kernel classifier that is more accurate than a linear classifier, and can be trained on tens of thousands of examples in two minutes. It processes training examples one by one in a sequence, so memory cost is no longer the bottleneck to process large scale datasets. This dissertation applies this approach to train classifiers of Flickr groups with many group training examples. The resulting Flickr group prediction scores can be used to measure image similarity between two images. Experimental results on the Corel dataset and a PASCAL VOC dataset show the learned Flickr features perform better on image matching, retrieval, and classification than conventional visual features. Visual models are usually trained to best separate positive and negative training examples. However, when recognizing a large number of object categories, there may not be enough training examples for most objects, due to the intrinsic long-tailed distribution of objects in the real world. This dissertation proposes an approach to use comparative object similarity. The key insight is that, given a set of object categories which are similar and a set of categories which are dissimilar, a good object model should respond more strongly to examples from similar categories than to examples from dissimilar categories. This dissertation develops a regularized kernel machine algorithm to use this category dependent similarity regularization. Experiments on hundreds of categories show that our method can make significant improvement for categories with few or even no positive examples.
Resumo:
A collaboration between dot.rural at the University of Aberdeen and the iSchool at Northumbria University, POWkist is a pilot-study exploring potential usages of currently available linked datasets within the cultural heritage domain. Many privately-held family history collections (shoebox archives) remain vulnerable unless a sustainable, affordable and accessible model of citizen-archivist digital preservation can be offered. Citizen-historians have used the web as a platform to preserve cultural heritage, however with no accessible or sustainable model these digital footprints have been ad hoc and rarely connected to broader historical research. Similarly, current approaches to connecting material on the web by exploiting linked datasets do not take into account the data characteristics of the cultural heritage domain. Funded by Semantic Media, the POWKist project is investigating how best to capture, curate, connect and present the contents of citizen-historians’ shoebox archives in an accessible and sustainable online collection. Using the Curios platform - an open-source digital archive - we have digitised a collection relating to a prisoner of war during WWII (1939-1945). Following a series of user group workshops, POWkist is now connecting these ‘made digital’ items with the broader web using a semantic technology model and identifying appropriate linked datasets of relevant content such as DBPedia (an archived linked dataset of Wikipedia) and Ordnance Survey Open Data. We are analysing the characteristics of cultural heritage linked datasets, so that these materials are better visualised, contextualised and presented in an attractive and comprehensive user interface. Our paper will consider the issues we have identified, the solutions we are developing and include a demonstration of our work-in-progress.
Resumo:
Blogging is one of the most common forms of social media today. Blogs have become a powerful media and bloggers are settled stakeholders to marketers. Commercialization of the blogosphere has enabled an increasing number of bloggers professionalize and blog as a full-time occupation. The purpose of this study is to understand the professionalization process of a blogger from an amateur blogger to a professional actor. The following sub-questions were used to further elaborate the topic: What have been the meaningful events and developments fostering professionalization? What are the prerequisites for popularity in blogging? Are there any key success factors to acknowledge in order being able to make business out of your blog? The theoretical framework of this study was formed based on the two chosen focus areas for professionalization; social drivers and business drivers. The theoretical framework is based on literature from fields of marketing and social sciences, as well as previous research on social media, blogging and professionalization. The study is a qualitative case-study and the research data was collected in a semi-structured interview. The case chosen to this study is a lifestyle-blog. The writer of the case blog has been able to develop her blog to become a full-time professional blogger. Based on the results, the professionalization process of a blogger is not a defined process, but instead comprised of coincidental events as well as considered advancements. Success in blogging is based on the bloggers own motivation and passion for writing and expressing oneself in the form of a blog, instead of a systematic construction of a successful career in blogging. Networking with other bloggers as well as affiliates was seen as an important success factor. Popularity in the blogosphere and a high number of followers enable professionalization, as marketers actively seek to collaborate with popular bloggers with strong personal brands. Bloggers with strong personal brands are especially attractive due to their opinion leadership in their reference group. A blogger can act professionally either as entrepreneur or blogging for a commercial webpage. According to the results of this study, it is beneficial for the blogger’s professional development as well as career progress, to act on different operating models
Resumo:
Global land cover maps play an important role in the understanding of the Earth's ecosystem dynamic. Several global land cover maps have been produced recently namely, Global Land Cover Share (GLC-Share) and GlobeLand30. These datasets are very useful sources of land cover information and potential users and producers are many times interested in comparing these datasets. However these global land cover maps are produced based on different techniques and using different classification schemes making their interoperability in a standardized way a challenge. The Environmental Information and Observation Network (EIONET) Action Group on Land Monitoring in Europe (EAGLE) concept was developed in order to translate the differences in the classification schemes into a standardized format which allows a comparison between class definitions. This is done by elaborating an EAGLE matrix for each classification scheme, where a bar code is assigned to each class definition that compose a certain land cover class. Ahlqvist (2005) developed an overlap metric to cope with semantic uncertainty of geographical concepts, providing this way a measure of how geographical concepts are more related to each other. In this paper, the comparison of global land cover datasets is done by translating each land cover legend into the EAGLE bar coding for the Land Cover Components of the EAGLE matrix. The bar coding values assigned to each class definition are transformed in a fuzzy function that is used to compute the overlap metric proposed by Ahlqvist (2005) and overlap matrices between land cover legends are elaborated. The overlap matrices allow the semantic comparison between the classification schemes of each global land cover map. The proposed methodology is tested on a case study where the overlap metric proposed by Ahlqvist (2005) is computed in the comparison of two global land cover maps for Continental Portugal. The study resulted with the overlap spatial distribution among the two global land cover maps, Globeland30 and GLC-Share. These results shows that Globeland30 product overlap with a degree of 77% with GLC-Share product in Continental Portugal.
Resumo:
Inter-subject parcellation of functional Magnetic Resonance Imaging (fMRI) data based on a standard General Linear Model (GLM) and spectral clustering was recently proposed as a means to alleviate the issues associated with spatial normalization in fMRI. However, for all its appeal, a GLM-based parcellation approach introduces its own biases, in the form of a priori knowledge about the shape of Hemodynamic Response Function (HRF) and task-related signal changes, or about the subject behaviour during the task. In this paper, we introduce a data-driven version of the spectral clustering parcellation, based on Independent Component Analysis (ICA) and Partial Least Squares (PLS) instead of the GLM. First, a number of independent components are automatically selected. Seed voxels are then obtained from the associated ICA maps and we compute the PLS latent variables between the fMRI signal of the seed voxels (which covers regional variations of the HRF) and the principal components of the signal across all voxels. Finally, we parcellate all subjects data with a spectral clustering of the PLS latent variables. We present results of the application of the proposed method on both single-subject and multi-subject fMRI datasets. Preliminary experimental results, evaluated with intra-parcel variance of GLM t-values and PLS derived t-values, indicate that this data-driven approach offers improvement in terms of parcellation accuracy over GLM based techniques.
Resumo:
Forecasting abrupt variations in wind power generation (the so-called ramps) helps achieve large scale wind power integration. One of the main issues to be confronted when addressing wind power ramp forecasting is the way in which relevant information is identified from large datasets to optimally feed forecasting models. To this end, an innovative methodology oriented to systematically relate multivariate datasets to ramp events is presented. The methodology comprises two stages: the identification of relevant features in the data and the assessment of the dependence between these features and ramp occurrence. As a test case, the proposed methodology was employed to explore the relationships between atmospheric dynamics at the global/synoptic scales and ramp events experienced in two wind farms located in Spain. The achieved results suggested different connection degrees between these atmospheric scales and ramp occurrence. For one of the wind farms, it was found that ramp events could be partly explained from regional circulations and zonal pressure gradients. To perform a comprehensive analysis of ramp underlying causes, the proposed methodology could be applied to datasets related to other stages of the wind-topower conversion chain.
Resumo:
ABSTRACT: This article talks about the female contemporary poetry, specifically in the era of blogs. It explores the idea that female poetry, in the contemporary production, surpasses the value of phallocentric tradition and religious mysticism that previously contributed to suffocate voices, referring women, as for their practices of reading and writing, the expansion of a silent poetry. This text develops some elements of the current female creation, placing poetry written by women in the virtual world defined by the diversity of environment, theme and languages. Thus, this paper presents data from a literary experience on the internet by twelve writers from all over Brazil. They differ in their professional and educational daily practices. Once the article discusses the multiplicity of forms of contemporary poetics on feminine writing production, it presents the case study “Maria Clara: universos femininos” - a collection born in a new genre and context -, in the realm of gender and poetry. KEYWORKS: Gender and poetry. Era blog. Maria Clara: uniVersos femininos.
Resumo:
The following technical report describes the approach and algorithm used to detect marine mammals from aerial imagery taken from manned/unmanned platform. The aim is to automate the process of counting the population of dugongs and other mammals. We have developed and algorithm that automatically presents to a user a number of possible candidates of these mammals. We tested the algorithm in two distinct datasets taken from different altitudes. Analysis and discussion is presented in regards with the complexity of the input datasets, the detection performance.