984 resultados para Clustering a large document collection


Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper evaluates the efficiency of a number of popular corpus-based distributional models in performing discovery on very large document sets, including online collections. Literature-based discovery is the process of identifying previously unknown connections from text, often published literature, that could lead to the development of new techniques or technologies. Literature-based discovery has attracted growing research interest ever since Swanson's serendipitous discovery of the therapeutic effects of fish oil on Raynaud's disease in 1986. The successful application of distributional models in automating the identification of indirect associations underpinning literature-based discovery has been heavily demonstrated in the medical domain. However, we wish to investigate the computational complexity of distributional models for literature-based discovery on much larger document collections, as they may provide computationally tractable solutions to tasks including, predicting future disruptive innovations. In this paper we perform a computational complexity analysis on four successful corpus-based distributional models to evaluate their fit for such tasks. Our results indicate that corpus-based distributional models that store their representations in fixed dimensions provide superior efficiency on literature-based discovery tasks.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We use the BBGKY hierarchy equations to calculate, perturbatively, the lowest order nonlinear correction to the two-point correlation and the pair velocity for Gaussian initial conditions in a critical density matter-dominated cosmological model. We compare our results with the results obtained using the hydrodynamic equations that neglect pressure and find that the two match, indicating that there are no effects of multistreaming at this order of perturbation. We analytically study the effect of small scales on the large scales by calculating the nonlinear correction for a Dirac delta function initial two-point correlation. We find that the induced two-point correlation has a x(-6) behavior at large separations. We have considered a class of initial conditions where the initial power spectrum at small k has the form k(n) with 0 < n less than or equal to 3 and have numerically calculated the nonlinear correction to the two-point correlation, its average over a sphere and the pair velocity over a large dynamical range. We find that at small separations the effect of the nonlinear term is to enhance the clustering, whereas at intermediate scales it can act to either increase or decrease the clustering. At large scales we find a simple formula that gives a very good fit for the nonlinear correction in terms of the initial function. This formula explicitly exhibits the influence of small scales on large scales and because of this coupling the perturbative treatment breaks down at large scales much before one would expect it to if the nonlinearity were local in real space. We physically interpret this formula in terms of a simple diffusion process. We have also investigated the case n = 0, and we find that it differs from the other cases in certain respects. We investigate a recently proposed scaling property of gravitational clustering, and we find that the lowest order nonlinear terms cause deviations from the scaling relations that are strictly valid in the linear regime. The approximate validity of these relations in the nonlinear regime in l(T)-body simulations cannot be understood at this order of evolution.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this article, we present a novel application of a quantum clustering (QC) technique to objectively cluster the conformations, sampled by molecular dynamics simulations performed on different ligand bound structures of the protein. We further portray each conformational population in terms of dynamically stable network parameters which beautifully capture the ligand induced variations in the ensemble in atomistic detail. The conformational populations thus identified by the QC method and verified by network parameters are evaluated for different ligand bound states of the protein pyrrolysyl-tRNA synthetase (DhPylRS) from D. hafniense. The ligand/environment induced re-distribution of protein conformational ensembles forms the basis for understanding several important biological phenomena such as allostery and enzyme catalysis. The atomistic level characterization of each population in the conformational ensemble in terms of the re-orchestrated networks of amino acids is a challenging problem, especially when the changes are minimal at the backbone level. Here we demonstrate that the QC method is sensitive to such subtle changes and is able to cluster MD snapshots which are similar at the side-chain interaction level. Although we have applied these methods on simulation trajectories of a modest time scale (20 ns each), we emphasize that our methodology provides a general approach towards an objective clustering of large-scale MD simulation data and may be applied to probe multistate equilibria at higher time scales, and to problems related to protein folding for any protein or protein-protein/RNA/DNA complex of interest with a known structure.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Many plain text information hiding techniques demand deep semantic processing, and so suffer in reliability. In contrast, syntactic processing is a more mature and reliable technology. Assuming a perfect parser, this paper evaluates a set of automated and reversible syntactic transforms that can hide information in plain text without changing the meaning or style of a document. A large representative collection of newspaper text is fed through a prototype system. In contrast to previous work, the output is subjected to human testing to verify that the text has not been significantly compromised by the information hiding procedure, yielding a success rate of 96% and bandwidth of 0.3 bits per sentence. © 2007 SPIE-IS&T.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Limited research has been conducted concerning the actual practice of health education in Victorian schools. This study investigates the health education curriculum at a large primary school in the south-eastern suburbs of Melbourne. The investigation involves a critical analysis of current practices in health education in the upper school through the development of a ‘small’ action research group. Data were gathered through document collection, questionnaires, interviews, discussions, diary and reflective journal entries. The action research group, consisting of the teacher-researcher and upper school teachers, developed, implemented and reflected upon units of work piloted with upper school students. Alternative approaches to health education were explored. The aim was to accommodate critically informed discourse amongst colleagues to promote self-reflective enquiry and facilitate improvements to existing pedagogic practices. During the course of the investigation, factors limiting and facilitating action research and curriculum change in health education, became evident. These included personal, practical, curriculum and organisational constraints operating externally and internally on the school and classroom environments. Despite these constraints, it was demonstrated in this study, that action research can contribute to the improvement of pedagogic practices in health education. Small ‘authentic’ action research projects may provide alternative internal professional development structures for teachers and consequently improve learning opportunities for students.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Comunicación presentada en Cross-Language Evaluation Forum (CLEF 2008), Aarhus, Denmark, September 17-19, 2008.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The objective of this work was to design, construct and commission a new ablative pyrolysis reactor and a high efficiency product collection system. The reactor was to have a nominal throughput of 10 kg/11r of dry biomass and be inherently scalable up to an industrial scale application of 10 tones/hr. The whole process consists of a bladed ablative pyrolysis reactor, two high efficiency cyclones for char removal and a disk and doughnut quench column combined with a wet walled electrostatic precipitator, which is directly mounted on top, for liquids collection. In order to aid design and scale-up calculations, detailed mathematical modelling was undertaken of the reaction system enabling sizes, efficiencies and operating conditions to be determined. Specifically, a modular approach was taken due to the iterative nature of some of the design methodologies, with the output from one module being the input to the next. Separate modules were developed for the determination of the biomass ablation rate, specification of the reactor capacity, cyclone design, quench column design and electrostatic precipitator design. These models enabled a rigorous design protocol to be developed capable of specifying the required reactor and product collection system size for specified biomass throughputs, operating conditions and collection efficiencies. The reactor proved capable of generating an ablation rate of 0.63 mm/s for pine wood at a temperature of 525 'DC with a relative velocity between the heated surface and reacting biomass particle of 12.1 m/s. The reactor achieved a maximum throughput of 2.3 kg/hr, which was the maximum the biomass feeder could supply. The reactor is capable of being operated at a far higher throughput but this would require a new feeder and drive motor to be purchased. Modelling showed that the reactor is capable of achieving a reactor throughput of approximately 30 kg/hr. This is an area that should be considered for the future as the reactor is currently operating well below its theoretical maximum. Calculations show that the current product collection system could operate efficiently up to a maximum feed rate of 10 kg/Fir, provided the inert gas supply was adjusted accordingly to keep the vapour residence time in the electrostatic precipitator above one second. Operation above 10 kg/hr would require some modifications to the product collection system. Eight experimental runs were documented and considered successful, more were attempted but due to equipment failure had to be abandoned. This does not detract from the fact that the reactor and product collection system design was extremely efficient. The maximum total liquid yield was 64.9 % liquid yields on a dry wood fed basis. It is considered that the liquid yield would have been higher had there been sufficient development time to overcome certain operational difficulties and if longer operating runs had been attempted to offset product losses occurring due to the difficulties in collecting all available product from a large scale collection unit. The liquids collection system was highly efficient and modeling determined a liquid collection efficiency of above 99% on a mass basis. This was validated due to the fact that a dry ice/acetone condenser and a cotton wool filter downstream of the collection unit enabled mass measurements of the amount of condensable product exiting the product collection unit. This showed that the collection efficiency was in excess of 99% on a mass basis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Traffic subarea division is vital for traffic system management and traffic network analysis in intelligent transportation systems (ITSs). Since existing methods may not be suitable for big traffic data processing, this paper presents a MapReduce-based Parallel Three-Phase K -Means (Par3PKM) algorithm for solving traffic subarea division problem on a widely adopted Hadoop distributed computing platform. Specifically, we first modify the distance metric and initialization strategy of K -Means and then employ a MapReduce paradigm to redesign the optimized K -Means algorithm for parallel clustering of large-scale taxi trajectories. Moreover, we propose a boundary identifying method to connect the borders of clustering results for each cluster. Finally, we divide traffic subarea of Beijing based on real-world trajectory data sets generated by 12,000 taxis in a period of one month using the proposed approach. Experimental evaluation results indicate that when compared with K -Means, Par2PK-Means, and ParCLARA, Par3PKM achieves higher efficiency, more accuracy, and better scalability and can effectively divide traffic subarea with big taxi trajectory data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The emergence of ePortfolios is relatively recent in the university sector as a way to engage students in their learning and assessment, and to produce records of their accomplishments. An ePortfolio is an online tool that students can utilise to record, catalogue, retrieve and present reflections and artefacts that support and demonstrate the development of graduate students’ capabilities and professional standards across university courses. The ePortfolio is therefore considered as both process and product. Although ePortfolios show promise as a useful tool and their uptake has grown, they are not yet a mainstream higher education technology. To date, the emphasis has been on investigating their potential to support the multiple purposes of learning, assessment and employability, but less is known about whether and how students engage with ePortfolios in the university setting. This thesis investigates student engagement with an ePortfolio in one university. As the educational designer for the ePortfolio project at the University, I was uniquely positioned as a researching professional to undertake an inquiry into whether students were engaging with the ePortfolio. The participants in this study were a cohort (defined by enrolment in a unit of study) of second and third year education students (n=105) enrolled in a four year Bachelor of Education degree. The students were introduced to the ePortfolio in an introductory lecture and a hands-on workshop in a computer laboratory. They were subsequently required to complete a compulsory assessment task – a critical reflection - using the ePortfolio. Following that, engagement with the ePortfolio was voluntary. A single case study approach arising from an interpretivist paradigm directed the methodological approach and research design for this study. The study investigated the participants’ own accounts of their experiences with the ePortfolio, including how and when they engaged with the ePortfolio and the factors that impacted on their engagement. Data collection methods consisted of an attitude survey, student interviews, document collection, a researcher reflective journal and researcher observations. The findings of the study show that, while the students were encouraged to use the ePortfolio as a learning and employability tool, most students ultimately chose to disengage after completing the assessment task. Only six of the forty-five students (13%) who completed the research survey had used the ePortfolio in a sustained manner. The data obtained from the students during this research has provided insight into reasons why they disengaged from the ePortfolio. The findings add to the understandings and descriptions of student engagement with technology, and more broadly, advance the understanding of ePortfolios. These findings also contribute to the interdisciplinary field of technology implementation. There are three key outcomes from this study, a model of student engagement with technology, a set of criteria for the design of an ePortfolio, and a set of recommendations for effective practice for those implementing ePortfolios. The first, the Model of Student Engagement with Technology (MSET) (Version 2) explored student engagement with technology by highlighting key engagement decision points for students The model was initially conceptualised by building on work of previous research (Version 1), however, following data analysis a new model emerged, MSET (Version 2). The engagement decision points were identified as: • Prior Knowledge and Experience, leading to imagined usefulness and imagined ease of use; • Initial Supported Engagement, leading to supported experience of usefulness and supported ease of use; • Initial Independent Engagement, leading to actual experience of independent usefulness and actual ease of use; and • Ongoing Independent Engagement, leading to ongoing experience of usefulness and ongoing ease of use. The Model of Student Engagement with Technology (MSET) goes beyond numerical figures of usage to demonstrate student engagement with an ePortfolio. The explanatory power of the model is based on the identification of the types of decisions that students make and when they make them during the engagement process. This model presents a greater depth of understanding student engagement than was previously available and has implications for the direction and timing of future implementation, and academic and student development activities. The second key outcome from this study is a set of criteria for the re-conceptualisation of the University ePortfolio. The knowledge gained from this research has resulted in a new set of design criteria that focus on the student actions of writing reflections and adding artefacts. The process of using the ePortfolio is reconceptualised in terms of privileging student learning over administrative compliance. The focus of the ePortfolio is that the writing of critical reflections is the key function, not the selection of capabilities. The third key outcome from this research consists of five recommendations for university practice that have arisen from this study. They are that, sustainable implementation is more often achieved through small steps building on one another; that a clear definition of the purpose of an ePortfolio is crucial for students and staff; that ePortfolio pedagogy should be the driving force not the technology; that the merit of the ePortfolio is fostered in students and staff; and finally, that supporting delayed task performance is crucial. Students do not adopt an ePortfolio just because it is provided. While students must accept responsibility for their own engagement with the ePortfolio, the institution has to accept responsibility for providing the environment, and technical and pedagogical support to foster engagement. Ultimately, an ePortfolio should be considered as a joint venture between student and institution where strong returns on investment can be realised by both. It is acknowledged that the current implementation strategies for the ePortfolio are just the beginning of a much longer process. The real rewards for students, academics and the university lie in the future.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we describe a machine-translated parallel English corpus for the NTCIR Chinese, Japanese and Korean (CJK) Wikipedia collections. This document collection is named CJK2E Wikipedia XML corpus. The corpus could be used by the information retrieval research community and knowledge sharing in Wikipedia in many ways; for example, this corpus could be used for experimentations in cross-lingual information retrieval, cross-lingual link discovery, or omni-lingual information retrieval research. Furthermore, the translated CJK articles could be used to further expand the current coverage of the English Wikipedia.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper reports on the 2nd ShARe/CLEFeHealth evaluation lab which continues our evaluation resource building activities for the medical domain. In this lab we focus on patients' information needs as opposed to the more common campaign focus of the specialised information needs of physicians and other healthcare workers. The usage scenario of the lab is to ease patients and next-of-kins' ease in understanding eHealth information, in particular clinical reports. The 1st ShARe/CLEFeHealth evaluation lab was held in 2013. This lab consisted of three tasks. Task 1 focused on named entity recognition and normalization of disorders; Task 2 on normalization of acronyms/abbreviations; and Task 3 on information retrieval to address questions patients may have when reading clinical reports. This year's lab introduces a new challenge in Task 1 on visual-interactive search and exploration of eHealth data. Its aim is to help patients (or their next-of-kin) in readability issues related to their hospital discharge documents and related information search on the Internet. Task 2 then continues the information extraction work of the 2013 lab, specifically focusing on disorder attribute identification and normalization from clinical text. Finally, this year's Task 3 further extends the 2013 information retrieval task, by cleaning the 2013 document collection and introducing a new query generation method and multilingual queries. De-identified clinical reports used by the three tasks were from US intensive care and originated from the MIMIC II database. Other text documents for Tasks 1 and 3 were from the Internet and originated from the Khresmoi project. Task 2 annotations originated from the ShARe annotations. For Tasks 1 and 3, new annotations, queries, and relevance assessments were created. 50, 79, and 91 people registered their interest in Tasks 1, 2, and 3, respectively. 24 unique teams participated with 1, 10, and 14 teams in Tasks 1, 2 and 3, respectively. The teams were from Africa, Asia, Canada, Europe, and North America. The Task 1 submission, reviewed by 5 expert peers, related to the task evaluation category of Effective use of interaction and targeted the needs of both expert and novice users. The best system had an Accuracy of 0.868 in Task 2a, an F1-score of 0.576 in Task 2b, and Precision at 10 (P@10) of 0.756 in Task 3. The results demonstrate the substantial community interest and capabilities of these systems in making clinical reports easier to understand for patients. The organisers have made data and tools available for future research and development.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Seasonal variations in the occurrence and abundance of penaeid prawn larvae in the Mandovi and Zuari estuaries of Goa were studied. Larvae and post-larvae of commercially important species viz. Metapenaeus dobsoni (Miers), M.affinis (H. Milne Edwards). M. Monoceros (Fabricius), Penaeus merguiensis de Man and Parapenaeopsis stylifera (H. Milne Edwards) were recorded in that order of abundance. Protozoea and mysis stages were dominant in surface zooplankton collections while the post-larvae were more in the bottom samples. Based on larval density, M. dobsoni appeared to be a continuous breeder. The active spawning periods in other species were during the late post-monsoon and pre-monsoon seasons varying with the species. Peak recruitment of post-larvae in the estuaries was observed mostly during southwest monsoon months (June to September). Penaeid prawn larval ingression was more in the Zuari estuary compared to the Mandovi estuary. Their numerical abundance gradually decreased towards the upstream areas. The feasibility of large scale collection of penaeid prawn larvae for aquaculture is indicated.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The task in text retrieval is to find the subset of a collection of documents relevant to a user's information request, usually expressed as a set of words. Classically, documents and queries are represented as vectors of word counts. In its simplest form, relevance is defined to be the dot product between a document and a query vector--a measure of the number of common terms. A central difficulty in text retrieval is that the presence or absence of a word is not sufficient to determine relevance to a query. Linear dimensionality reduction has been proposed as a technique for extracting underlying structure from the document collection. In some domains (such as vision) dimensionality reduction reduces computational complexity. In text retrieval it is more often used to improve retrieval performance. We propose an alternative and novel technique that produces sparse representations constructed from sets of highly-related words. Documents and queries are represented by their distance to these sets. and relevance is measured by the number of common clusters. This technique significantly improves retrieval performance, is efficient to compute and shares properties with the optimal linear projection operator and the independent components of documents.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The method described here cannot fully replace the analysis of large columns by small test columns (microcolumns). The procedure, however, is suitable for speeding up the determination of adsorption parameters of dye onto the adsorbent and for speeding up the initial screening of a large adsorbent collection that can be tedious if a several adsorbents and adsorption conditions must be tested. The performance of methylene blue (MB), a basic dye, Cibacron reactive black (RB) and Cibacron reactive yellow (RY) was predicted in this way and the influence of initial dye concentration and other adsorption conditions on the adsorption behaviour were demonstrated.