4 resultados para Online services using open-source NLP tools
Resumo:
Android is becoming ubiquitous and currently has the largest share of the mobile OS market with billions of application downloads from the official app market. It has also become the platform most targeted by mobile malware that are becoming more sophisticated to evade state-of-the-art detection approaches. Many Android malware families employ obfuscation techniques in order to avoid detection and this may defeat static analysis based approaches. Dynamic analysis on the other hand may be used to overcome this limitation. Hence in this paper we propose DynaLog, a dynamic analysis based framework for characterizing Android applications. The framework provides the capability to analyse the behaviour of applications based on an extensive number of dynamic features. It provides an automated platform for mass analysis and characterization of apps that is useful for quickly identifying and isolating malicious applications. The DynaLog framework leverages existing open source tools to extract and log high level behaviours, API calls, and critical events that can be used to explore the characteristics of an application, thus providing an extensible dynamic analysis platform for detecting Android malware. DynaLog is evaluated using real malware samples and clean applications demonstrating its capabilities for effective analysis and detection of malicious applications.
Resumo:
Here, we describe gene expression compositional assignment (GECA), a powerful, yet simple method based on compositional statistics that can validate the transfer of prior knowledge, such as gene lists, into independent data sets, platforms and technologies. Transcriptional profiling has been used to derive gene lists that stratify patients into prognostic molecular subgroups and assess biomarker performance in the pre-clinical setting. Archived public data sets are an invaluable resource for subsequent in silico validation, though their use can lead to data integration issues. We show that GECA can be used without the need for normalising expression levels between data sets and can outperform rank-based correlation methods. To validate GECA, we demonstrate its success in the cross-platform transfer of gene lists in different domains including: bladder cancer staging, tumour site of origin and mislabelled cell lines. We also show its effectiveness in transferring an epithelial ovarian cancer prognostic gene signature across technologies, from a microarray to a next-generation sequencing setting. In a final case study, we predict the tumour site of origin and histopathology of epithelial ovarian cancer cell lines. In particular, we identify and validate the commonly-used cell line OVCAR-5 as non-ovarian, being gastrointestinal in origin. GECA is available as an open-source R package.
Resumo:
Display At Your Own Risk (DAYOR) is a research-led exhibition experiment featuring digital surrogates of public domain works of art produced by cultural heritage institutions of international repute. The project includes a gallery exhibition, an open source version of that exhibition intended for public use, and two online publications: the Exhibition Catalogue, and a companion Metadata Book.
Resumo:
Background and aims: Machine learning techniques for the text mining of cancer-related clinical documents have not been sufficiently explored. Here some techniques are presented for the pre-processing of free-text breast cancer pathology reports, with the aim of facilitating the extraction of information relevant to cancer staging.
Materials and methods: The first technique was implemented using the freely available software RapidMiner to classify the reports according to their general layout: ‘semi-structured’ and ‘unstructured’. The second technique was developed using the open source language engineering framework GATE and aimed at the prediction of chunks of the report text containing information pertaining to the cancer morphology, the tumour size, its hormone receptor status and the number of positive nodes. The classifiers were trained and tested respectively on sets of 635 and 163 manually classified or annotated reports, from the Northern Ireland Cancer Registry.
Results: The best result of 99.4% accuracy – which included only one semi-structured report predicted as unstructured – was produced by the layout classifier with the k nearest algorithm, using the binary term occurrence word vector type with stopword filter and pruning. For chunk recognition, the best results were found using the PAUM algorithm with the same parameters for all cases, except for the prediction of chunks containing cancer morphology. For semi-structured reports the performance ranged from 0.97 to 0.94 and from 0.92 to 0.83 in precision and recall, while for unstructured reports performance ranged from 0.91 to 0.64 and from 0.68 to 0.41 in precision and recall. Poor results were found when the classifier was trained on semi-structured reports but tested on unstructured.
Conclusions: These results show that it is possible and beneficial to predict the layout of reports and that the accuracy of prediction of which segments of a report may contain certain information is sensitive to the report layout and the type of information sought.