9 resultados para Clustering a large document collection
em University of Queensland eSpace - Australia
Resumo:
Document ranking is an important process in information retrieval (IR). It presents retrieved documents in an order of their estimated degrees of relevance to query. Traditional document ranking methods are mostly based on the similarity computations between documents and query. In this paper we argue that the similarity-based document ranking is insufficient in some cases. There are two reasons. Firstly it is about the increased information variety. There are far too many different types documents available now for user to search. The second is about the users variety. In many cases user may want to retrieve documents that are not only similar but also general or broad regarding a certain topic. This is particularly the case in some domains such as bio-medical IR. In this paper we propose a novel approach to re-rank the retrieved documents by incorporating the similarity with their generality. By an ontology-based analysis on the semantic cohesion of text, document generality can be quantified. The retrieved documents are then re-ranked by their combined scores of similarity and the closeness of documents’ generality to the query’s. Our experiments have shown an encouraging performance on a large bio-medical document collection, OHSUMED, containing 348,566 medical journal references and 101 test queries.
Resumo:
Recent efforts in the characterization of air-water flows properties have included some clustering process analysis. A cluster of bubbles is defined as a group of two or more bubbles, with a distinct separation from other bubbles before and after the cluster. The present paper compares the results of clustering processes two hydraulic structures. That is, a large-size dropshaft and a hydraulic jump in a rectangular horizontal channel. The comparison highlighted some significant differences in clustering production and structures. Both dropshaft and hydraulic jump flows are complex turbulent shear flows, and some clustering index may provide some measure of the bubble-turbulence interactions and associated energy dissipation.
Resumo:
Rectangular dropshafts, commonly used in sewers and storm water systems, are characterised by significant flow aeration. New detailed air-water flow measurements were conducted in a near-full-scale dropshaft at large discharges. In the shaft pool and outflow channel, the results demonstrated the complexity of different competitive air entrainment mechanisms. Bubble size measurements showed a broad range of entrained bubble sizes. Analysis of streamwise distributions of bubbles suggested further some clustering process in the bubbly flow although, in the outflow channel, bubble chords were in average smaller than in the shaft pool. A robust hydrophone was tested to measure bubble acoustic spectra and to assess its field application potential. The acoustic results characterised accurately the order of magnitude of entrained bubble sizes, but the transformation from acoustic frequencies to bubble radii did not predict correctly the probability distribution functions of bubble sizes.
Resumo:
Using data from the H I Parkes All Sky Survey (HIPASS), we have searched for neutral hydrogen in galaxies in a region similar to25x25 deg(2) centred on NGC 1399, the nominal centre of the Fornax cluster. Within a velocity search range of 300-3700 km s(-1) and to a 3sigma lower flux limit of similar to40 mJy, 110 galaxies with H I emission were detected, one of which is previously uncatalogued. None of the detections has early-type morphology. Previously unknown velocities for 14 galaxies have been determined, with a further four velocity measurements being significantly dissimilar to published values. Identification of an optical counterpart is relatively unambiguous for more than similar to90 per cent of our H I galaxies. The galaxies appear to be embedded in a sheet at the cluster velocity which extends for more than 30degrees across the search area. At the nominal cluster distance of similar to20 Mpc, this corresponds to an elongated structure more than 10 Mpc in extent. A velocity gradient across the structure is detected, with radial velocities increasing by similar to500 km s(-1) from south-east to north-west. The clustering of galaxies evident in optical surveys is only weakly suggested in the spatial distribution of our H I detections. Of 62 H I detections within a 10degrees projected radius of the cluster centre, only two are within the core region (projected radius
Resumo:
Aim: To test the efficacy of a comprehensive health assessment using the CHAP tool in adults with an intellectual disability (ID). Method: A cluster randomised control design was used. The intervention group received the CHAP, while the control group received usual care. This tool directed carers to gather a health history, which was reviewed by the person’s general practitioner (GP) who completed a medical examination and a healthcare plan. The tool acted as an advocacy tool, a ticket-of-entry to the GPs surgery and educated the GP and the caregiver about the deficits in the healthcare of adults with ID. The healthcare of the participants was followed for one-year after intervention by the collection of data from GP and service providers’ notes. Also interviews were performed with all those involved. Results: We obtained a representative sample of adults with ID (RR%). We found the intervention group received a significant increase in many health promotion/disease prevention activities e.g. hearing screening was times and a Pap smear was times more likely to have occurred in the intervention groups.We also found a trend towards earlier detection of disease. Conclusions: The CHAP process improves the provision of health screening/promotion activities and should be implemented.
Resumo:
Objective: To document outcome and to investigate patterns of physical and psychosocial recovery in the first year following severe traumatic brain injury (TBI) in an Australian patient sample. Design: A longitudinal prospective study of a cohort of patients, with data collection at 3, 6, 9, and 12 months post injury. Setting: A head injury rehabilitation unit in a large metropolitan public hospital. Patients: A sample of 55 patients selected from 120 consecutive admissions with severe TBI. Patients who were more than 3 months post injury on admission, who remained confused, or who had severe communication deficits or a previous neurologic disorder were excluded. Interventions: All subjects participated in a multidisciplinary inpatient rehabilitation program, followed by varied participation in outpatient rehabilitation and community-based sen ices. Main Outcome Measures: The Sickness impact Profile (SIP) provided physical, psychosocial, and total dysfunction scores at each follow-up. Outcome at 1 year was measured by the Disability Rating Scale. Results: Multivariate analysis of variance indicated that the linear trend of recovery over time was less for psychosocial dysfunction than for physical dysfunction (F(1,51) = 5.87, P < .02). One rear post injury, 22% of subjects had returned to their previous level of employability, and 42% were able to live independently. Conclusions: Recovery from TBI in this Australian sample followed a pattern similar to that observed in other countries, with psychosocial dysfunction being more persistent. Self-report measures such as the SIP in TBI research are limited by problems of diminished self-awareness.
Resumo:
Motivation: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets.
Resumo:
The effect of number of samples and selection of data for analysis on the calculation of surface motor unit potential (SMUP) size in the statistical method of motor unit number estimates (MUNE) was determined in 10 normal subjects and 10 with amyotrophic lateral sclerosis (ALS). We recorded 500 sequential compound muscle action potentials (CMAPs) at three different stable stimulus intensities (10–50% of maximal CMAP). Estimated mean SMUP sizes were calculated using Poisson statistical assumptions from the variance of 500 sequential CMAP obtained at each stimulus intensity. The results with the 500 data points were compared with smaller subsets from the same data set. The results using a range of 50–80% of the 500 data points were compared with the full 500. The effect of restricting analysis to data between 5–20% of the CMAP and to standard deviation limits was also assessed. No differences in mean SMUP size were found with stimulus intensity or use of different ranges of data. Consistency was improved with a greater sample number. Data within 5% of CMAP size gave both increased consistency and reduced mean SMUP size in many subjects, but excluded valid responses present at that stimulus intensity. These changes were more prominent in ALS patients in whom the presence of isolated SMUP responses was a striking difference from normal subjects. Noise, spurious data, and large SMUP limited the Poisson assumptions. When these factors are considered, consistent statistical MUNE can be calculated from a continuous sequence of data points. A 2 to 2.5 SD or 10% window are reasonable methods of limiting data for analysis. Muscle Nerve 27: 320–331, 2003
Resumo:
This paper discusses a document discovery tool based on Conceptual Clustering by Formal Concept Analysis. The program allows users to navigate e-mail using a visual lattice metaphor rather than a tree. It implements a virtual. le structure over e-mail where files and entire directories can appear in multiple positions. The content and shape of the lattice formed by the conceptual ontology can assist in e-mail discovery. The system described provides more flexibility in retrieving stored e-mails than what is normally available in e-mail clients. The paper discusses how conceptual ontologies can leverage traditional document retrieval systems and aid knowledge discovery in document collections.