820 resultados para Data classification
Resumo:
Smart homes for the aging population have recently started attracting the attention of the research community. The "health state" of smart homes is comprised of many different levels; starting with the physical health of citizens, it also includes longer-term health norms and outcomes, as well as the arena of positive behavior changes. One of the problems of interest is to monitor the activities of daily living (ADL) of the elderly, aiming at their protection and well-being. For this purpose, we installed passive infrared (PIR) sensors to detect motion in a specific area inside a smart apartment and used them to collect a set of ADL. In a novel approach, we describe a technology that allows the ground truth collected in one smart home to train activity recognition systems for other smart homes. We asked the users to label all instances of all ADL only once and subsequently applied data mining techniques to cluster in-home sensor firings. Each cluster would therefore represent the instances of the same activity. Once the clusters were associated to their corresponding activities, our system was able to recognize future activities. To improve the activity recognition accuracy, our system preprocessed raw sensor data by identifying overlapping activities. To evaluate the recognition performance from a 200-day dataset, we implemented three different active learning classification algorithms and compared their performance: naive Bayesian (NB), support vector machine (SVM) and random forest (RF). Based on our results, the RF classifier recognized activities with an average specificity of 96.53%, a sensitivity of 68.49%, a precision of 74.41% and an F-measure of 71.33%, outperforming both the NB and SVM classifiers. Further clustering markedly improved the results of the RF classifier. An activity recognition system based on PIR sensors in conjunction with a clustering classification approach was able to detect ADL from datasets collected from different homes. Thus, our PIR-based smart home technology could improve care and provide valuable information to better understand the functioning of our societies, as well as to inform both individual and collective action in a smart city scenario.
Resumo:
Background The RCSB Protein Data Bank (PDB) provides public access to experimentally determined 3D-structures of biological macromolecules (proteins, peptides and nucleic acids). While various tools are available to explore the PDB, options to access the global structural diversity of the entire PDB and to perceive relationships between PDB structures remain very limited. Methods A 136-dimensional atom pair 3D-fingerprint for proteins (3DP) counting categorized atom pairs at increasing through-space distances was designed to represent the molecular shape of PDB-entries. Nearest neighbor searches examples were reported exemplifying the ability of 3DP-similarity to identify closely related biomolecules from small peptides to enzyme and large multiprotein complexes such as virus particles. The principle component analysis was used to obtain the visualization of PDB in 3DP-space. Results The 3DP property space groups proteins and protein assemblies according to their 3D-shape similarity, yet shows exquisite ability to distinguish between closely related structures. An interactive website called PDB-Explorer is presented featuring a color-coded interactive map of PDB in 3DP-space. Each pixel of the map contains one or more PDB-entries which are directly visualized as ribbon diagrams when the pixel is selected. The PDB-Explorer website allows performing 3DP-nearest neighbor searches of any PDB-entry or of any structure uploaded as protein-type PDB file. All functionalities on the website are implemented in JavaScript in a platform-independent manner and draw data from a server that is updated daily with the latest PDB additions, ensuring complete and up-to-date coverage. The essentially instantaneous 3DP-similarity search with the PDB-Explorer provides results comparable to those of much slower 3D-alignment algorithms, and automatically clusters proteins from the same superfamilies in tight groups. Conclusion A chemical space classification of PDB based on molecular shape was obtained using a new atom-pair 3D-fingerprint for proteins and implemented in a web-based database exploration tool comprising an interactive color-coded map of the PDB chemical space and a nearest neighbor search tool. The PDB-Explorer website is freely available at www.cheminfo.org/pdbexplorer and represents an unprecedented opportunity to interactively visualize and explore the structural diversity of the PDB.
Resumo:
STUDY QUESTION How comprehensive is the recently published European Society of Human Reproduction and Embryology (ESHRE)/European Society for Gynaecological Endoscopy (ESGE) classification system of female genital anomalies? SUMMARY ANSWER The ESHRE/ESGE classification provides a comprehensive description and categorization of almost all of the currently known anomalies that could not be classified properly with the American Fertility Society (AFS) system. WHAT IS KNOWN ALREADY Until now, the more accepted classification system, namely that of the AFS, is associated with serious limitations in effective categorization of female genital anomalies. Many cases published in the literature could not be properly classified using the AFS system, yet a clear and accurate classification is a prerequisite for treatment. STUDY DESIGN, SIZE AND DURATION The CONUTA (CONgenital UTerine Anomalies) ESHRE/ESGE group conducted a systematic review of the literature to examine if those types of anomalies that could not be properly classified with the AFS system could be effectively classified with the use of the new ESHRE/ESGE system. An electronic literature search through Medline, Embase and Cochrane library was carried out from January 1988 to January 2014. Three participants independently screened, selected articles of potential interest and finally extracted data from all the included studies. Any disagreement was discussed and resolved after consultation with a fourth reviewer and the results were assessed independently and approved by all members of the CONUTA group. PARTICIPANTS/MATERIALS, SETTING, METHODS Among the 143 articles assessed in detail, 120 were finally selected reporting 140 cases that could not properly fit into a specific class of the AFS system. Those 140 cases were clustered in 39 different types of anomalies. MAIN RESULTS AND THE ROLE OF CHANCE The congenital anomaly involved a single organ in 12 (30.8%) out of the 39 types of anomalies, while multiple organs and/or segments of Müllerian ducts (complex anomaly) were involved in 27 (69.2%) types. Uterus was the organ most frequently involved (30/39: 76.9%), followed by cervix (26/39: 66.7%) and vagina (23/39: 59%). In all 39 types, the ESHRE/ESGE classification system provided a comprehensive description of each single or complex anomaly. A precise categorization was reached in 38 out of 39 types studied. Only one case of a bizarre uterine anomaly, with no clear embryological defect, could not be categorized and thus was placed in Class 6 (un-classified) of the ESHRE/ESGE system. LIMITATIONS, REASONS FOR CAUTION The review of the literature was thorough but we cannot rule out the possibility that other defects exist which will also require testing in the new ESHRE/ESGE system. These anomalies, however, must be rare. WIDER IMPLICATIONS OF THE FINDINGS The comprehensiveness of the ESHRE/ESGE classification adds objective scientific validity to its use. This may, therefore, promote its further dissemination and acceptance, which will have a positive outcome in clinical care and research. STUDY FUNDING/COMPETING INTERESTS None.
Lung Pattern Classification for Interstitial Lung Diseases Using a Deep Convolutional Neural Network
Resumo:
Automated tissue characterization is one of the most crucial components of a computer aided diagnosis (CAD) system for interstitial lung diseases (ILDs). Although much research has been conducted in this field, the problem remains challenging. Deep learning techniques have recently achieved impressive results in a variety of computer vision problems, raising expectations that they might be applied in other domains, such as medical image analysis. In this paper, we propose and evaluate a convolutional neural network (CNN), designed for the classification of ILD patterns. The proposed network consists of 5 convolutional layers with 2×2 kernels and LeakyReLU activations, followed by average pooling with size equal to the size of the final feature maps and three dense layers. The last dense layer has 7 outputs, equivalent to the classes considered: healthy, ground glass opacity (GGO), micronodules, consolidation, reticulation, honeycombing and a combination of GGO/reticulation. To train and evaluate the CNN, we used a dataset of 14696 image patches, derived by 120 CT scans from different scanners and hospitals. To the best of our knowledge, this is the first deep CNN designed for the specific problem. A comparative analysis proved the effectiveness of the proposed CNN against previous methods in a challenging dataset. The classification performance (~85.5%) demonstrated the potential of CNNs in analyzing lung patterns. Future work includes, extending the CNN to three-dimensional data provided by CT volume scans and integrating the proposed method into a CAD system that aims to provide differential diagnosis for ILDs as a supportive tool for radiologists.
Resumo:
METHODS Spirometry datasets from South-Asian children were collated from four centres in India and five within the UK. Records with transcription errors, missing values for height or spirometry, and implausible values were excluded(n = 110). RESULTS Following exclusions, cross-sectional data were available from 8,124 children (56.3% male; 5-17 years). When compared with GLI-predicted values from White Europeans, forced expired volume in 1s (FEV1) and forced vital capacity (FVC) in South-Asian children were on average 15% lower, ranging from 4-19% between centres. By contrast, proportional reductions in FEV1 and FVC within all but two datasets meant that the FEV1/FVC ratio remained independent of ethnicity. The 'GLI-Other' equation fitted data from North India reasonably well while 'GLI-Black' equations provided a better approximation for South-Asian data than the 'GLI-White' equation. However, marked discrepancies in the mean lung function z-scores between centres especially when examined according to socio-economic conditions precluded derivation of a single South-Asian GLI-adjustment. CONCLUSION Until improved and more robust prediction equations can be derived, we recommend the use of 'GLI-Black' equations for interpreting most South-Asian data, although 'GLI-Other' may be more appropriate for North Indian data. Prospective data collection using standardised protocols to explore potential sources of variation due to socio-economic circumstances, secular changes in growth/predictors of lung function and ethnicities within the South-Asian classification are urgently required.
Resumo:
Material Safety Data Sheets (MSDSs) are an integral component of occupational hazard communication systems. These documents are used to disseminate hazard information to workers on chemical substances. The primary purpose of this study was to investigate the comprehensibility of MSDSs by workers at an international level. ^ A total of 117 employees of a multi-national petrochemical company participated; thirty-nine (39) each in the United States, Canada and the United Kingdom. Overall participation rate of those approached to participate was 82%. These countries were selected as they each utilize one of the three major existing hazard communication systems for fixed workplaces. The systems are comprised of the Occupational Safety and Health Administration's Hazard Communication Standard in the United States, the Workplace Hazardous Materials Information System (WHMIS) in Canada, and the compilation of several European Union directives addressing classification, labeling of substances and preparations, and MSDSs in Europe. ^ A pretest posttest randomized study design was used, with the posttest being comparable to an open book test. The results of this research indicated that only about two-thirds of the information on the MSDSs was comprehended by the workers with a significant difference identified among study participants based on country comparisons. This data was fairly consistent with the results of previous MSDS comprehensibility studies conducted in the United States. There was no significant difference in the comprehension level among study participants when taking into account the international hazard communication standard that the MSDS complied with. Marginally, age, education level and experience level did not have a significant impact on the comprehension level. ^ Participants did find MSDSs to be satisfactory in providing the information needed to protect them regardless of their views on the readability and formatting of MSDSs. The health-related information was the least comprehended as less than half of it was comprehended on the basis of the responses. The findings from this research suggest that there is much work needed yet to make MSDSs more comprehensible on a global basis, particularly regarding health-related information. ^
Resumo:
It is well accepted that tumorigenesis is a multi-step procedure involving aberrant functioning of genes regulating cell proliferation, differentiation, apoptosis, genome stability, angiogenesis and motility. To obtain a full understanding of tumorigenesis, it is necessary to collect information on all aspects of cell activity. Recent advances in high throughput technologies allow biologists to generate massive amounts of data, more than might have been imagined decades ago. These advances have made it possible to launch comprehensive projects such as (TCGA) and (ICGC) which systematically characterize the molecular fingerprints of cancer cells using gene expression, methylation, copy number, microRNA and SNP microarrays as well as next generation sequencing assays interrogating somatic mutation, insertion, deletion, translocation and structural rearrangements. Given the massive amount of data, a major challenge is to integrate information from multiple sources and formulate testable hypotheses. This thesis focuses on developing methodologies for integrative analyses of genomic assays profiled on the same set of samples. We have developed several novel methods for integrative biomarker identification and cancer classification. We introduce a regression-based approach to identify biomarkers predictive to therapy response or survival by integrating multiple assays including gene expression, methylation and copy number data through penalized regression. To identify key cancer-specific genes accounting for multiple mechanisms of regulation, we have developed the integIRTy software that provides robust and reliable inferences about gene alteration by automatically adjusting for sample heterogeneity as well as technical artifacts using Item Response Theory. To cope with the increasing need for accurate cancer diagnosis and individualized therapy, we have developed a robust and powerful algorithm called SIBER to systematically identify bimodally expressed genes using next generation RNAseq data. We have shown that prediction models built from these bimodal genes have the same accuracy as models built from all genes. Further, prediction models with dichotomized gene expression measurements based on their bimodal shapes still perform well. The effectiveness of outcome prediction using discretized signals paves the road for more accurate and interpretable cancer classification by integrating signals from multiple sources.
Resumo:
Maximizing data quality may be especially difficult in trauma-related clinical research. Strategies are needed to improve data quality and assess the impact of data quality on clinical predictive models. This study had two objectives. The first was to compare missing data between two multi-center trauma transfusion studies: a retrospective study (RS) using medical chart data with minimal data quality review and the PRospective Observational Multi-center Major Trauma Transfusion (PROMMTT) study with standardized quality assurance. The second objective was to assess the impact of missing data on clinical prediction algorithms by evaluating blood transfusion prediction models using PROMMTT data. RS (2005-06) and PROMMTT (2009-10) investigated trauma patients receiving ≥ 1 unit of red blood cells (RBC) from ten Level I trauma centers. Missing data were compared for 33 variables collected in both studies using mixed effects logistic regression (including random intercepts for study site). Massive transfusion (MT) patients received ≥ 10 RBC units within 24h of admission. Correct classification percentages for three MT prediction models were evaluated using complete case analysis and multiple imputation based on the multivariate normal distribution. A sensitivity analysis for missing data was conducted to estimate the upper and lower bounds of correct classification using assumptions about missing data under best and worst case scenarios. Most variables (17/33=52%) had <1% missing data in RS and PROMMTT. Of the remaining variables, 50% demonstrated less missingness in PROMMTT, 25% had less missingness in RS, and 25% were similar between studies. Missing percentages for MT prediction variables in PROMMTT ranged from 2.2% (heart rate) to 45% (respiratory rate). For variables missing >1%, study site was associated with missingness (all p≤0.021). Survival time predicted missingness for 50% of RS and 60% of PROMMTT variables. MT models complete case proportions ranged from 41% to 88%. Complete case analysis and multiple imputation demonstrated similar correct classification results. Sensitivity analysis upper-lower bound ranges for the three MT models were 59-63%, 36-46%, and 46-58%. Prospective collection of ten-fold more variables with data quality assurance reduced overall missing data. Study site and patient survival were associated with missingness, suggesting that data were not missing completely at random, and complete case analysis may lead to biased results. Evaluating clinical prediction model accuracy may be misleading in the presence of missing data, especially with many predictor variables. The proposed sensitivity analysis estimating correct classification under upper (best case scenario)/lower (worst case scenario) bounds may be more informative than multiple imputation, which provided results similar to complete case analysis.^
Resumo:
Cervical cancer is the leading cause of death and disease from malignant neoplasms among women in developing countries. Even though the Pap smear has significantly decreased the number of deaths from cervical cancer in the past years, it has its limitations. Researchers have developed an automated screening machine which can potentially detect abnormal cases that are overlooked by conventional screening. The goal of quantitative cytology is to classify the patient's tissue sample based on quantitative measurements of the individual cells. It is also much cheaper and potentially can take less time. One of the major challenges of collecting cells with a cytobrush is the possibility of not sampling any existing dysplastic cells on the cervix. Being able to correctly classify patients who have disease without the presence of dysplastic cells could improve the accuracy of quantitative cytology algorithms. Subtle morphologic changes in normal-appearing tissues adjacent to or distant from malignant tumors have been shown to exist, but a comparison of various statistical methods, including many recent advances in the statistical learning field, has not previously been done. The objective of this thesis is to use different classification methods applied to quantitative cytology data for the detection of malignancy associated changes (MACs). In this thesis, Elastic Net is the best algorithm. When we applied the Elastic Net algorithm to the test set, we combined the training set and validation set as "training" set and used 5-fold cross validation to choose the parameter for Elastic Net. It has a sensitivity of 47% at 80% specificity, an AUC 0.52, and a partial AUC 0.10 (95% CI 0.09-0.11).^
Resumo:
Brownfield rehabilitation is an essential step for sustainable land-use planning and management in the European Union. In brownfield regeneration processes, the legacy contamination plays a significant role, firstly because of the persistent contaminants in soil or groundwater which extends the existing hazards and risks well into the future; and secondly, problems from historical contamination are often more difficult to manage than contamination caused by new activities. Due to the complexity associated with the management of brownfield site rehabilitation, Decision Support Systems (DSSs) have been developed to support problem holders and stakeholders in the decision-making process encompassing all phases of the rehabilitation. This paper presents a comparative study between two DSSs, namely SADA (Spatial Analysis and Decision Assistance) and DESYRE (Decision Support System for the Requalification of Contaminated Sites), with the main objective of showing the benefits of using DSSs to introduce and process data and then to disseminate results to different stakeholders involved in the decision-making process. For this purpose, a former car manufacturing plant located in the Brasov area, Central Romania, contaminated chiefly by heavy metals and total petroleum hydrocarbons, has been selected as a case study to apply the two examined DSSs. Major results presented here concern the analysis of the functionalities of the two DSSs in order to identify similarities, differences and complementarities and, thus, to provide an indication of the most suitable integration options.
Resumo:
Coral reefs represent major accumulations of calcium carbonate (CaCO3). The particularly labyrinthine network of reefs in Torres Strait, north of the Great Barrier Reef (GBR), has been examined in order to estimate their gross CaCO3 productivity. The approach involved a two-step procedure, first characterising and classifying the morphology of reefs based on a classification scheme widely employed on the GBR and then estimating gross CaCO3 productivity rates across the region using a regional census-based approach. This was undertaken by independently verifying published rates of coral reef community gross production for use in Torres Strait, based on site-specific ecological and morphological data. A total of 606 reef platforms were mapped and classified using classification trees. Despite the complexity of the maze of reefs in Torres Strait, there are broad morphological similarities with reefs in the GBR. The spatial distribution and dimensions of reef types across both regions are underpinned by similar geological processes, sea-level history in the Holocene and exposure to the same wind/wave energetic regime, resulting in comparable geomorphic zonation. However, the presence of strong tidal currents flowing through Torres Strait and the relatively shallow and narrow dimensions of the shelf exert a control on local morphology and spatial distribution of the reef platforms. A total amount of 8.7 million tonnes of CaCO3 per year, at an average rate of 3.7 kg CaCO3 m-2 yr-1 (G), were estimated for the studied area. Extrapolated production rates based on detailed and regional census-based approaches for geomorphic zones across Torres Strait were comparable to those reported elsewhere, particularly values for the GBR based on alkalinity-reduction methods. However, differences in mapping methodologies and the impact of reduced calcification due to global trends in coral reef ecological decline and changing oceanic physical conditions warrant further research. The novel method proposed in this study to characterise the geomorphology of reef types based on classification trees provides an objective and repeatable data-driven approach that combined with regional census-based approaches has the potential to be adapted and transferred to different coral reef regions, depicting a more accurate picture of interactions between reef ecology and geomorphology.
Resumo:
In this study multibeam angular backscatter data acquired in the eastern slope of the Porcupine Seabight are analysed. Processing of the angular backscatter data using the 'NRGCOR' software was made for 29 locations comprising different geological provinces like: carbonate mounds, buried mounds, seafloor channels, and inter-channel areas. A detailed methodology is developed to produce a map of angle-invariant (normalized) backscatter data by correcting the local angular backscatter values. The present paper involves detailed processing steps and related technical aspects of the normalization approach. The presented angle-invariant backscatter map possesses 12 dB dynamic range in terms of grey scale. A clear distinction is seen between the mound dominated northern area (Belgica province) and the Gollum channel seafloor at the southern end of the site. Qualitative analyses of the calculated mean backscatter values i.e., grey scale levels, utilizing angle-invariant backscatter data generally indicate backscatter values are highest (lighter grey scale) in the mound areas followed by buried mounds. The backscatter values are lowest in the inter-channel areas (lowest grey scale level). Moderate backscatter values (medium grey level) are observed from the Gollum and Kings channel data, and significant variability within the channel seafloor provinces. The segmentation of the channel seafloor provinces are made based on the computed grey scale levels for further analyses based on the angular backscatter strength. Three major parameters are utilized to classify four different seafloor provinces of the Porcupine Seabight by employing a semi-empirical method to analyse multibeam angular backscatter data. The predicted backscatter response which has been computed at 20° is the highest for the mound areas. The coefficient of variation (CV) of the mean backscatter response is also the highest for the mound areas. Interestingly, the slope value of the buried mound areas are found to be the highest. However, the channel seafloor of moderate backscatter response presents the lowest slope and CV values. A critical examination of the inter-channel areas indicates less variability within the estimated three parameters.
Resumo:
The Wadden Sea is located in the southeastern part of the North Sea forming an extended intertidal area along the Dutch, German and Danish coast. It is a highly dynamic and largely natural ecosystem influenced by climatic changes and anthropogenic use of the North Sea. Changes in the environment of the Wadden Sea, natural or anthropogenic origin, cannot be monitored by the standard measurement methods alone, because large-area surveys of the intertidal flats are often difficult due to tides, tidal channels and unstable underground. For this reason, remote sensing offers effective monitoring tools. In this study a multi-sensor concept for classification of intertidal areas in the Wadden Sea has been developed. Basis for this method is a combined analysis of RapidEye (RE) and TerraSAR-X (TSX) satellite data coupled with ancillary vector data about the distribution of vegetation, mussel beds and sediments. The classification of the vegetation and mussel beds is based on a decision tree and a set of hierarchically structured algorithms which use object and texture features. The sediments are classified by an algorithm which uses thresholds and a majority filter. Further improvements focus on radiometric enhancement and atmospheric correction. First results show that we are able to identify vegetation and mussel beds with the use of multi-sensor remote sensing. The classification of the sediments in the tidal flats is a challenge compared to vegetation and mussel beds. The results demonstrate that the sediments cannot be classified with high accuracy by their spectral properties alone due to their similarity which is predominately caused by their water content.
Resumo:
Sea floor morphology plays an important role in many scientific disciplines such as ecology, hydrology and sedimentology since geomorphic features can act as physical controls for e.g. species distribution, oceanographically flow-path estimations or sedimentation processes. In this study, we provide a terrain analysis of the Weddell Sea based on the 500 m × 500 m resolution bathymetry data provided by the mapping project IBCSO. Seventeen seabed classes are recognized at the sea floor based on a fine and broad scale Benthic Positioning Index calculation highlighting the diversity of the glacially carved shelf. Beside the morphology, slope, aspect, terrain rugosity and hillshade were calculated. Applying zonal statistics to the geomorphic features identified unambiguously the shelf edge of the Weddell Sea with a width of 45-70 km and a mean depth of about 1200 m ranging from 270 m to 4300 m. A complex morphology of troughs, flat ridges, pinnacles, steep slopes, seamounts, outcrops, and narrow ridges, structures with approx. 5-7 km width, build an approx. 40-70 km long swath along the shelf edge. The study shows where scarps and depressions control the connection between shelf and abyssal and where high and low declination within the scarps e.g. occur. For evaluation purpose, 428 grain size samples were added to the seabed class map. The mean values of mud, sand and gravel of those samples falling into a single seabed class was calculated, respectively, and assigned to a sediment texture class according to a common sediment classification scheme.