970 resultados para Datasets


Relevância:

10.00% 10.00%

Publicador:

Resumo:

A RkNN query returns all objects whose nearest k neighbors
contain the query object. In this paper, we consider RkNN
query processing in the case where the distances between
attribute values are not necessarily metric. Dissimilarities
between objects could then be a monotonic aggregate of dissimilarities
between their values, such aggregation functions
being specified at query time. We outline real world cases
that motivate RkNN processing in such scenarios. We consider
the AL-Tree index and its applicability in RkNN query
processing. We develop an approach that exploits the group
level reasoning enabled by the AL-Tree in RkNN processing.
We evaluate our approach against a Naive approach
that performs sequential scans on contiguous data and an
improved block-based approach that we provide. We use
real-world datasets and synthetic data with varying characteristics
for our experiments. This extensive empirical
evaluation shows that our approach is better than existing
methods in terms of computational and disk access costs,
leading to significantly better response times.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Most traditional data mining algorithms struggle to cope with the sheer scale of data efficiently. In this paper, we propose a general framework to accelerate existing clustering algorithms to cluster large-scale datasets which contain large numbers of attributes, items, and clusters. Our framework makes use of locality sensitive hashing (LSH) to significantly reduce the cluster search space. We also theoretically prove that our framework has a guaranteed error bound in terms of the clustering quality. This framework can be applied to a set of centroid-based clustering algorithms that assign an object to the most similar cluster, and we adopt the popular K-Modes categorical clustering algorithm to present how the framework can be applied. We validated our framework with five synthetic datasets and a real world Yahoo! Answers dataset. The experimental results demonstrate that our framework is able to speed up the existing clustering algorithm between factors of 2 and 6, while maintaining comparable cluster purity.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Learning or writing regular expressions to identify instances of a specific
concept within text documents with a high precision and recall is challenging.
It is relatively easy to improve the precision of an initial regular expression
by identifying false positives covered and tweaking the expression to avoid the
false positives. However, modifying the expression to improve recall is difficult
since false negatives can only be identified by manually analyzing all documents,
in the absence of any tools to identify the missing instances. We focus on partially
automating the discovery of missing instances by soliciting minimal user
feedback. We present a technique to identify good generalizations of a regular
expression that have improved recall while retaining high precision. We empirically
demonstrate the effectiveness of the proposed technique as compared to
existing methods and show results for a variety of tasks such as identification of
dates, phone numbers, product names, and course numbers on real world datasets

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Textual problem-solution repositories are available today in
various forms, most commonly as problem-solution pairs from community
question answering systems. Modern search engines that operate on
the web can suggest possible completions in real-time for users as they
type in queries. We study the problem of generating intelligent query
suggestions for users of customized search systems that enable querying
over problem-solution repositories. Due to the small scale and specialized
nature of such systems, we often do not have the luxury of depending on
query logs for finding query suggestions. We propose a retrieval model
for generating query suggestions for search on a set of problem solution
pairs. We harness the problem solution partition inherent in such
repositories to improve upon traditional query suggestion mechanisms
designed for systems that search over general textual corpora. We evaluate
our technique over real problem-solution datasets and illustrate that
our technique provides large and statistically significant

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Online forums are becoming a popular way of finding useful
information on the web. Search over forums for existing discussion
threads so far is limited to keyword-based search due
to the minimal effort required on part of the users. However,
it is often not possible to capture all the relevant context in a
complex query using a small number of keywords. Examplebased
search that retrieves similar discussion threads given
one exemplary thread is an alternate approach that can help
the user provide richer context and vastly improve forum
search results. In this paper, we address the problem of
finding similar threads to a given thread. Towards this, we
propose a novel methodology to estimate similarity between
discussion threads. Our method exploits the thread structure
to decompose threads in to set of weighted overlapping
components. It then estimates pairwise thread similarities
by quantifying how well the information in the threads are
mutually contained within each other using lexical similarities
between their underlying components. We compare our
proposed methods on real datasets against state-of-the-art
thread retrieval mechanisms wherein we illustrate that our
techniques outperform others by large margins on popular
retrieval evaluation measures such as NDCG, MAP, Precision@k
and MRR. In particular, consistent improvements of
up to 10% are observed on all evaluation measures

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Mining seafloor massive sulfides for metals is an emergent industry faced with environmental management challenges. These revolve largely around limits to our current understanding of biological variability in marine systems, a challenge common to all marine environmental management. VentBase was established as a forum where academic, commercial, governmental, and non-governmental stakeholders can develop a consensus regarding the management of exploitative activities in the deep-sea. Participants advocate a precautionary approach with the incorporation of lessons learned from coastal studies. This workshop report from VentBase encourages the standardization of sampling methodologies for deep-sea environmental impact assessment. VentBase stresses the need for the collation of spatial data and importance of datasets amenable to robust statistical analyses. VentBase supports the identification of set-asides to prevent the local extirpation of vent-endemic communities and for the post-extraction recolonization of mine sites. © 2013.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper, a novel and effective lip-based biometric identification approach with the Discrete Hidden Markov Model Kernel (DHMMK) is developed. Lips are described by shape features (both geometrical and sequential) on two different grid layouts: rectangular and polar. These features are then specifically modeled by a DHMMK, and learnt by a support vector machine classifier. Our experiments are carried out in a ten-fold cross validation fashion on three different datasets, GPDS-ULPGC Face Dataset, PIE Face Dataset and RaFD Face Dataset. Results show that our approach has achieved an average classification accuracy of 99.8%, 97.13%, and 98.10%, using only two training images per class, on these three datasets, respectively. Our comparative studies further show that the DHMMK achieved a 53% improvement against the baseline HMM approach. The comparative ROC curves also confirm the efficacy of the proposed lip contour based biometrics learned by DHMMK. We also show that the performance of linear and RBF SVM is comparable under the frame work of DHMMK.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper is concerned with the application of an automated hybrid approach in addressing the university timetabling problem. The approach described is based on the nature-inspired artificial bee colony (ABC) algorithm. An ABC algorithm is a biologically-inspired optimization approach, which has been widely implemented in solving a range of optimization problems in recent years such as job shop scheduling and machine timetabling problems. Although the approach has proven to be robust across a range of problems, it is acknowledged within the literature that there currently exist a number of inefficiencies regarding the exploration and exploitation abilities. These inefficiencies can often lead to a slow convergence speed within the search process. Hence, this paper introduces a variant of the algorithm which utilizes a global best model inspired from particle swarm optimization to enhance the global exploration ability while hybridizing with the great deluge (GD) algorithm in order to improve the local exploitation ability. Using this approach, an effective balance between exploration and exploitation is attained. In addition, a traditional local search approach is incorporated within the GD algorithm with the aim of further enhancing the performance of the overall hybrid method. To evaluate the performance of the proposed approach, two diverse university timetabling datasets are investigated, i.e., Carter's examination timetabling and Socha course timetabling datasets. It should be noted that both problems have differing complexity and different solution landscapes. Experimental results demonstrate that the proposed method is capable of producing high quality solutions across both these benchmark problems, showing a good degree of generality in the approach. Moreover, the proposed method produces best results on some instances as compared with other approaches presented in the literature.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

HOX genes are master regulators of organ morphogenesis and cell differentiation during embryonic development, and continue to be expressed throughout post-natal life. To test the hypothesis that HOX genes are dysregulated in head and neck squamous cell carcinoma (HNSCC) we defined their expression profile, and investigated the function, transcriptional regulation and clinical relevance of a subset of highly expressed HOXD genes. Two HOXD genes, D10 and D11, showed strikingly high levels in HNSCC cell lines, patient tumor samples and publicly available datasets. Knockdown of HOXD10 in HNSCC cells caused decreased proliferation and invasion, whereas knockdown of HOXD11 reduced only invasion. POU2F1 consensus sequences were identified in the 5' DNA of HOXD10 and D11. Knockdown of POU2F1 significantly reduced expression of HOXD10 and D11 and inhibited HNSCC proliferation. Luciferase reporter constructs of the HOXD10 and D11 promoters confirmed that POU2F1 consensus binding sites are required for optimal promoter activity. Utilizing patient tumor samples a significant association was found between immunohistochemical staining of HOXD10 and both the overall and the disease-specific survival, adding further support that HOXD10 is dysregulated in head and neck cancer. Additional studies are now warranted to fully evaluate HOXD10 as a prognostic tool in head and neck cancers.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The recent explosion of genetic and clinical data generated from tumor genome analysis presents an unparalleled opportunity to enhance our understanding of cancer, but this opportunity is compromised by the reluctance of many in the scientific community to share datasets and the lack of interoperability between different data platforms. The Global Alliance for Genomics and Health is addressing these barriers and challenges through a cooperative framework that encourages "team science" and responsible data sharing, complemented by the development of a series of application program interfaces that link different data platforms, thus breaking down traditional silos and liberating the data to enable new discoveries and ultimately benefit patients.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Generating timetables for an institution is a challenging and time consuming task due to different demands on the overall structure of the timetable. In this paper, a new hybrid method which is a combination of a great deluge and artificial bee colony algorithm (INMGD-ABC) is proposed to address the university timetabling problem. Artificial bee colony algorithm (ABC) is a population based method that has been introduced in recent years and has proven successful in solving various optimization problems effectively. However, as with many search based approaches, there exist weaknesses in the exploration and exploitation abilities which tend to induce slow convergence of the overall search process. Therefore, hybridization is proposed to compensate for the identified weaknesses of the ABC. Also, inspired from imperialist competitive algorithms, an assimilation policy is implemented in order to improve the global exploration ability of the ABC algorithm. In addition, Nelder–Mead simplex search method is incorporated within the great deluge algorithm (NMGD) with the aim of enhancing the exploitation ability of the hybrid method in fine-tuning the problem search region. The proposed method is tested on two differing benchmark datasets i.e. examination and course timetabling datasets. A statistical analysis t-test has been conducted and shows the performance of the proposed approach as significantly better than basic ABC algorithm. Finally, the experimental results are compared against state-of-the art methods in the literature, with results obtained that are competitive and in certain cases achieving some of the current best results to those in the literature.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

PURPOSE: The purpose of this study is to establish the prevalence of potentially inappropriate prescribing (PIP) in middle-aged adults (45-64 years) in two populations with differing socio-economic profiles, and to investigate factors associated with PIP, using the PROMPT (PRescribing Optimally in Middle-aged People's Treatments) criteria.METHODS: A retrospective cross-sectional study was conducted using 2012 data from the Enhanced Prescribing Database (EPD), covering the full population in Northern Ireland and the Health Services Executive Primary Care Reimbursement Service (HSE-PCRS) database, covering the most socio-economically deprived third of the population in this age group in the Republic of Ireland. The prevalence for each PROMPT criterion and overall prevalence of PIP were calculated. Logistic regression was used to investigate the association between PIP and gender, age group and polypharmacy.RESULTS: This study included 441,925 patients from the EPD and 309,748 patients from the HSE-PCRS database. Polypharmacy was common in both datasets (46.7 % in the HSE-PCRS and 20.3 % in the EPD). The prevalence of PIP was 42.9 % (95%CI 42.7, 43.1) in the HSE-PCRS and 21.1 % (95%CI 21.0, 21.2) in the EPD. Age group, female gender and polypharmacy were significantly associated with PIP in both populations (p < 0.05) and polypharmacy had the strongest association.CONCLUSIONS: PIP is common amongst middle-aged people with the risk of PIP increasing with polypharmacy. Differences in the prevalence of polypharmacy and PIP between the two populations may relate to heterogeneity in healthcare services and different socio-economic profiles, with higher rates of multimorbidity and associated polypharmacy in more deprived groups.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Time-domain modelling of single-reed woodwind instruments usually involves a lumped model of the excitation mechanism. The parameters of this lumped model have to be estimated for use in numerical simulations. Several attempts have been made to estimate these parameters, including observations of the mechanics of isolated reeds, measurements under artificial or real playing conditions and estimations based on numerical simulations. In this study an optimisation routine is presented, that can estimate reed-model parameters, given the pressure and flow signals in the mouthpiece. The method is validated, tested on a series of numerically synthesised data. In order to incorporate the actions of the player in the parameter estimation process, the optimisation routine has to be applied to signals obtained under real playing conditions. The estimated parameters can then be used to resynthesise the pressure and flow signals in the mouthpiece. In the case of measured data, as opposed to numerically synthesised data, special care needs to be taken while modelling the bore of the instrument. In fact, a careful study of various experimental datasets revealed that for resynthesis to work, the bore termination impedance should be known very precisely from theory. An example is given, where the above requirement is satisfied, and the resynthesised signals closely match the original signals generated by the player.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Aim: Our primary aim is to understand how assemblages of rare (restricted range) and common (widespread) species are correlated with each other among different taxa. We tested the proposition that marine species richness patterns of rare and common species differ, both within a taxon in their contribution to the richness pattern of the full assemblage and among taxa in the strength of their correlations with each other. Location The UK intertidal zone. Methods: We used high-resolution marine datasets for UK intertidal macroalgae, molluscs and crustaceans each with more than 400 species. We estimated the relative contribution of rare and common species, treating rarity and commonness as a continuous spectrum, to spatial patterns in richness using spatial crosscorrelations. Correlation strength and significance was estimated both within and between taxa. Results: Common species drove richness patterns within taxa, but rare species contributed more when species were placed on an equal footing via scaling by binomial variance. Between taxa, relatively small sub-assemblages (fewer than 60 species) of common species produced the maximum correlation with each other, regardless of taxon pairing. Cross-correlations between rare species were generally weak, with maximum correlation occurring between small sub-assemblages in only one case. Cross-correlations between common and rare species of different taxa were consistently weak or absent. Main conclusions: Common species in the three marine assemblages were congruent in their richness patterns, but rare species were generally not. The contrast between the stronger correlations among common species and the weak or absent correlations among rare species indicates a decoupling of the processes driving common and rare species richness patterns. The internal structure of richness patterns of these marine taxa is similar to that observed for terrestrial taxa.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper we propose a novel recurrent neural networkarchitecture for video-based person re-identification.Given the video sequence of a person, features are extracted from each frame using a convolutional neural network that incorporates a recurrent final layer, which allows information to flow between time-steps. The features from all time steps are then combined using temporal pooling to give an overall appearance feature for the complete sequence. The convolutional network, recurrent layer, and temporal pooling layer, are jointly trained to act as a feature extractor for video-based re-identification using a Siamese network architecture.Our approach makes use of colour and optical flow information in order to capture appearance and motion information which is useful for video re-identification. Experiments are conduced on the iLIDS-VID and PRID-2011 datasets to show that this approach outperforms existing methods of video-based re-identification.

https://github.com/niallmcl/Recurrent-Convolutional-Video-ReID
Project Source Code