902 resultados para Large Data Sets
Resumo:
We present a novel method to perform an accurate registration of 3-D nonrigid bodies by using phase-shift properties of the dual-tree complex wavelet transform (DT-CWT). Since the phases of DT-\BBCWT coefficients change approximately linearly with the amount of feature displacement in the spatial domain, motion can be estimated using the phase information from these coefficients. The motion estimation is performed iteratively: first by using coarser level complex coefficients to determine large motion components and then by employing finer level coefficients to refine the motion field. We use a parametric affine model to describe the motion, where the affine parameters are found locally by substituting into an optical flow model and by solving the resulting overdetermined set of equations. From the estimated affine parameters, the motion field between the sensed and the reference data sets can be generated, and the sensed data set then can be shifted and interpolated spatially to align with the reference data set. © 2011 IEEE.
Resumo:
We live in an era of abundant data. This has necessitated the development of new and innovative statistical algorithms to get the most from experimental data. For example, faster algorithms make practical the analysis of larger genomic data sets, allowing us to extend the utility of cutting-edge statistical methods. We present a randomised algorithm that accelerates the clustering of time series data using the Bayesian Hierarchical Clustering (BHC) statistical method. BHC is a general method for clustering any discretely sampled time series data. In this paper we focus on a particular application to microarray gene expression data. We define and analyse the randomised algorithm, before presenting results on both synthetic and real biological data sets. We show that the randomised algorithm leads to substantial gains in speed with minimal loss in clustering quality. The randomised time series BHC algorithm is available as part of the R package BHC, which is available for download from Bioconductor (version 2.10 and above) via http://bioconductor.org/packages/2.10/bioc/html/BHC.html. We have also made available a set of R scripts which can be used to reproduce the analyses carried out in this paper. These are available from the following URL. https://sites.google.com/site/randomisedbhc/.
Resumo:
对隆肛蛙属的物种构成进行了订正,建立新属肛刺蛙属Yerana gen. nov.;订正后的隆肛蛙属现仅隶2种, 即隆肛蛙F. quadrana和太行隆肛蛙F. taihangnicus。运用形态学分析探讨了隆肛蛙属物种及种群的形态差异和分类关系,通过分子系统学研究探讨了隆肛蛙属物种及种群的分类和系统发育关系,运用动物地理学方法结合系统发育关系探讨了隆肛蛙属种群的地理分布格局成因与历史过程。主要结果和推论如下: 1.隆肛蛙属物种构成的订正及一新属建立 建立新属肛刺蛙属,将隆肛蛙属中的原叶氏隆肛蛙F. yei归隶新属肛刺蛙属并更名为叶氏肛刺蛙Y. yei,,新属建立的主要依据为:(1)雄性肛部隆起,肛孔下方有两个布满黑刺的大的白色球形隆起,具单咽下内声囊, 第一指具婚刺;(2)形态量度分析表明叶氏肛刺蛙与隆肛蛙和太行隆肛蛙的形态差异远大于后两者之间的差异;(3)叶氏肛刺蛙的分布区与隆肛蛙和太行隆肛蛙的分布区距离较远且呈隔离状态;(4)分子系统学研究资料(Jiang et al.,2005)证明叶氏肛刺蛙与隆肛蛙和太行隆肛蛙非单系发生;叶氏肛刺蛙在第二支中位于基部。因此,隆肛蛙属现仅隶2种,即隆肛蛙和太行隆肛蛙。 2.隆肛蛙属种群形态学研究 对隆肛蛙属中隆肛蛙和太行隆肛蛙的15个地理种群565只标本的28项形态性状进行了测量,运用典型判别分析法对其分析的结果表明:(1)太行隆肛蛙与隆肛蛙形态差异明显,支持其为不同的物种;(2)原隆肛蛙河南伏牛山种群和山西中条山种群应为太行隆肛蛙的地理种群;(3)隆肛蛙不同地理种群之间形态差异明显,其中四川安县种群、陕西周至种群和湖北利川种群与模式产地重庆巫山种群的差异可能达到了亚种或亚种以上分化水平。对隆肛蛙属量度分析的15个种群进行定性形态分析表明其分为三种形态型,对应隆肛蛙、过渡型和太行隆肛蛙,其变异特征主要为内跗褶、雄性肛部隆起及疣粒分布、第五趾外侧缘膜等,这与量度分析结果相似。 3.隆肛蛙属种群分子系统学研究 测定隆肛蛙属Feirana的2种19种群的线粒体12S rRNA和16S rRNA基因片段、ND2基因的DNA序列,比对后共计1953bps。(1)遗传多样性与距离分析:结果表明,隆肛蛙属种群具很高的遗传多样性,19个种群样品表现出19种单倍型(遗传多样性指数Hd=1.0); ND2基因的进化信息含量远高于12SrRNA和16SrRNA。隆肛蛙属2种群组内的种群间的遗传距离远小于两种群组间的距离,种群在不同基因上的遗传距离表现的关系与对应的系统树一致。(2)系统发育关系分析:结果表明,不同基因片断基于不同方法构建的隆肛蛙属种群系统发育树结构基本一致,基本表明隆肛蛙属种群为单系发生;它们在系统树中分为两大支,分别对应于隆肛蛙和太行隆肛蛙;支持中条山种群(沁水、历山和济源种群)和伏牛山种群(栾川和内乡种群)为太行隆肛蛙的地理种群,而原隆肛蛙秦岭中东段的部分种群(柞水、宁陕、长安大坝沟种群)也应为太行隆肛蛙的地理种群。(3)亚种分化分析:根据遗传距离分析和系统发育关系分析结果,并考虑形态上的差异情况以及地理分布信息,隆肛蛙所隶种群组可分为2亚种,即隆肛蛙指名亚种F. quadrana quadrana包括四川盆地东缘大巴山东段-巫山-武陵山北麓种群和秦岭中段(周至板房子和长安广货街)种群,他们在系统关系树上聚为一支;安县亚种F. quadrana anxianensis包括四川盆地西缘岷山东麓-龙门山-大巴山和秦岭西段的种群(安县、青川、文县、南江和凤县种群),他们在系统关系树上聚为一支。太行隆肛蛙所隶种群组也可分为2亚种,即太行隆肛蛙指名亚种F. taihangnicus taihangnicus包括中条山的种群(沁水、历山和济源种群)和中东秦岭的部分种群(柞水、长安大坝沟和宁陕种群),他们在系统关系树上聚为一支;太行隆肛蛙伏牛亚种F. taihangnicus funiuensis,为伏牛山地区的种群(栾川和内乡种群),他们在系统关系树上聚为一支。 4.隆肛蛙属种群动物地理学研究 隆肛蛙属19种群的分歧年代分析: 以长江巫山段和黄河三门峡段的形成历史时期为参考点,根据已测隆肛蛙属19种群及其外群包括N. pleski、P. yunnanesis、P. robertingeri、F. limnocharis的1953bps DNA序列构建分子钟,获得各支系的分歧年代。结果表明:①棘蛙族在70Ma左右开始其独立演化历程,这与Roelants et al.(2004)的分析结果~60±15Ma左右开始分化基本一致,后者印证了本文的分子钟。②隆肛蛙属的起始分化年代较早,隆肛蛙和太行隆肛蛙两种群组的最近祖先种群大概在46Ma~50Ma左右;隆肛蛙和太行隆肛蛙种群组内的种群分化年代相对两种群组间晚得多, 隆肛蛙种群组内两亚种分化起始年代约为10Ma左右,而太行隆肛蛙种群组内两亚种分化起始年代约为6Ma。 隆肛蛙属种群分布格局形成过程分析: ①隆肛蛙属的系统关系与地理分布格局密切相关,大部分系统分支分级与地理距离成正比;②隆肛蛙属最近祖先种群的分化中心可能位于秦岭中部地区, 隆肛蛙属的种群分布格局的形成表现为隔离分化与扩散相结合的机制,由隔离分化产生的隆肛蛙祖先种群主要从秦岭中部向西南方向扩散,后隔离分化为两亚种;太行隆肛蛙祖先种群向东北方向扩散也分化为两亚种。 隆肛蛙属种群分布区域地质历史的探讨:本文所建分子钟和种群分化方式印证了该区域的几次主要地质事件,包括岷山-龙门山-西秦岭等地区的快速差异隆起、第四纪冰期等。 The specific composition of the genus Feirana should be revised. A new genus Yerana gen. nov.(Ranidae:Dicroglossinae)was established based on morphological data-set and molecular phylogeny, as a result, only two species F. quadrana and F. taihangnicus are classified into Feirana now. Morphological differences and taxonomy of populations of Feirana were investigated based on morphological and morphometric data; phylogenetic relationships and taxonomy of populations of Feirana were elucidated using molecular data, and then the proceeding of the distribution pattern of populations of Feirana were discussed. The main results and conclusions and proposals were presented as following: 1. Revising of the specific composition of the genus Feirana and establishment of a new genus The new genus Yerana, only containing the type species Y. yei, was established based on the following evidences: (1) In adult male, distinct up-heaved circular vesicle presents around the anal, and under anal there are two white balls on which black spines exist, black horny spines scatter on the upper side of first finger, and internal single subgular vocal sac presents; (2) there is obvious morphometric differences between Yerana and Feirana; (3) Yerana is distributed far from Feirana; (4) evidences of molecular phylogeny(Jiang et al.,2005)suggested that Yerana take a special phylogenetic clade which is different from other genus included in the tribe Paini. As a result, there are only two species in Feirana, i.e., F. quadrana and F. taihangnicus. 2. Morphological research of populations of Feirana Twenty-eight characters of 565 individuals of 15 populations of the genus Feirana were measured, the results of Canonical Discriminant analysis of the morphometric data-set indicated that: (1) there are very prominent differences between the two species F. quadrana and F. taihangnicus. The validity of species F. taihangnicus was approved here; (2) Mt. Funiu population and Mt. Zhongtiao population should belong to the species F. taihangnicus; (3) Obvious differences exist among 12 populations of F. quadrana, the differentiation among Zhouzhi population, Anxian population, Lichuan population, and Wushan population together with the others probably reach sub-specific or specific level. Result of morphological comparison between 15 different populations show that 3 morphological types are recogenized in according with F. quadrana, F. taihangnicus and intergradation, this result conform to the result of morphometric analysis. 3. Molecular phylogenetic study on populaions of Feirana Fragment of 12SrRNA and 16SrRNA genes, and ND2 gene of 19 populations of two species of Feirana were sequenced and aligned, from which 1953 bps were received. (1) analyses of genetic distance and hereditary diversity indicated that: genetic distance between populations in each group were less than distance between two groups of Feirana, 19 haplotypes were recognized from 19 samples of 19 populations, so the hereditary diversity of populations of Feirana was very high (Hd=1.0), phylogenetic information in ND2 gene is more than fragment sequence of 12SrRNA and 16SrRNA genes. (2) Result of molecular phylogeny indicate that the phylogenetic trees constructed using different methods based on different sequence data sets showed the revised genus Feirana is monophyletic since the 19 populations of Feirana were firstly clustered together as one large clade, which was further clustered into two major clades, corresponding to F. quadrana(GroupⅠ) and F. taihangnicus(GroupⅡ), respectively. So populations of Qinshui and Lishan in Mt. Zhongtiao, populations of Luanchuan and Neixiang in Mt. Funiu, and populations of Zhashui, Dabagou of Chang’an and Ningshan in eastern Mt. Qinling should belong to the species F. taihangnicus; (3) Subspecific differentiation. on the basis of genetic distance, phylogenetic trees and geographical distribution, F. quadrana should have two subspecies, i.e., F. quadrana qudadrana, consisting of the populations Guanghuojie of Chang’an and Zhouzhi in Mid-Mt. Qinling, populations in Wushan area and northern Mt. Wuling (Lichuan), and F. qudadrana anxianensis, consisting of the populations in eastern Mt. Ming shan-Mt. Longmen-western Mt. Daba-western Mt. Qinling (Anxian, Qingchuan, Wenxian, Nanjiang and Fengxian); F. taihangnicus should also has two subspecies, i.e., F. taihangnicus taihangnicus, consisting of the populations in Mt. Zhongtiao and eastern Mt. Qinling, and F. taihangnicus funiuensis, consisting of the populations in Mt. Funiu. 4. Zoogeography of populaions of Feirana Analysis for divergent time of 19 populations of Feirana: Using the dates of run-through of Wushan segment of Changjiang River as the time when the population of Lichuan started differentiated from the populations of Wushan and Shennongjia, and the dates of Sanmenxia segment of Yellow River as the time when the populations in Mt. Zhongtiao started differentiated from the population of Dabagou in Chang’an, molecular clock was established using sequences with 1953 bps of 19 populations of Feirana and outgroup including N. pleski, P. yunnanesis, P. robertingeri, F. limnocharis in order to estimate divergent time of all clades. Result of that indicated that: ① the tribe Paini started to evolve independently at about 70Ma when is in consistent with that estimated by Roelants et al.(2004)with result of about ~60±15Ma, they were corroborated by each other, this confirms the validity of this molecular clock; ② divergent time for speciation of Feriana is early, ancestral populations of F. quadrana and F. taihangnicus were found about 46Ma~50Ma; differentiation of populations within species is greatly late to the divergence of the two species, divergent time for F. quadrana is 10Ma and divergent time for F. taihangnicus is 6Ma. Proceeding of distribution pattern of Feirana. Phylogenetic relationships of populations of Feirana matched quite with distribution pattern of them, the relationships among clades showed in phylogenetic trees is direct ratio to geographical distance of them; the estimated date of speciation between two species of Feirana was as early as speciation of Paa yunnanesis and Nanara pleski; middle part of Mt. Qinling is the center of speciation of Feirana, combination of mult-events of dispersal and vicariance are probably the mechanism of speciation of Feirana, F. quadrana colonized the mid-Mt. Qinling and then differentiated into two subspecies in southwest direction, ancestral population of F. taihangnicus colonized the mid-Mt. Qinling and then differentiated into two subspecies in northeast direction. On geological history of the distribution of Feirana. According to molecular clock and speciation model of populations of Feirana, some geological events are confirmed, including special rise of Mt. Minshan- Mt. Longmen-western Mt. Qinling, glacial age.
Resumo:
Decision tree classification algorithms have significant potential for land cover mapping problems and have not been tested in detail by the remote sensing community relative to more conventional pattern recognition techniques such as maximum likelihood classification. In this paper, we present several types of decision tree classification algorithms arid evaluate them on three different remote sensing data sets. The decision tree classification algorithms tested include an univariate decision tree, a multivariate decision tree, and a hybrid decision tree capable of including several different types of classification algorithms within a single decision tree structure. Classification accuracies produced by each of these decision tree algorithms are compared with both maximum likelihood and linear discriminant function classifiers. Results from this analysis show that the decision tree algorithms consistently outperform the maximum likelihood and linear discriminant function classifiers in regard to classf — cation accuracy. In particular, the hybrid tree consistently produced the highest classification accuracies for the data sets tested. More generally, the results from this work show that decision trees have several advantages for remote sensing applications by virtue of their relatively simple, explicit, and intuitive classification structure. Further, decision tree algorithms are strictly nonparametric and, therefore, make no assumptions regarding the distribution of input data, and are flexible and robust with respect to nonlinear and noisy relations among input features and class labels.
Resumo:
This paper studies how to more effectively invert seismic data and predict reservoir under complicated sedimentary environment, complex rock physical relationships and fewer drills in offshore areas of China. Based on rock physical and seismic amplitude-preserving process, and according to depositional system and laws of hydrocarbon reservoir, in the light of feature of seismic inversion methods present applied, series methods were studied. A joint inversion technology for complex geological condition had been presented, at the same time the process and method system for reservoir prediction had been established. This method consists four key parts. 1)We presented the new conception called generalized wave impedance, established corresponding inversion process, and provided technical means for joint inversion lithology and petrophysical on complex geological condition. 2)At the aspect of high-resolution nonlinear seismic wave impedance joint inversion, this method used a multistage nonlinear seismic convolution model rather than conventional primary structure Robinson seismic convolution model, and used Caianiello neural network implement inversion. Based on the definition of multistage positive and negative wavelet, it adopted both deterministic and statistical physical mechanism, direct inversion and indirect inversion. It integrated geological knowledge, rock physical theory, well data, and seismic data, and improved the resolution and anti-noise ability of wave impedence inversion. 3)At the aspect of high-resolution nonlinear reservoir physical property joint inversion, this method used nonlinear rock physical model which introduced convolution model into the relationship between wave impedance and porosity/clay. Through multistage decomposition, it handles separately the large- and small-scale components of the impedance-porosity/clay relationships to achieve more accurate rock physical relationships. By means of bidirectional edge detection with wavelets, it uses the Caianiello neural network to finish statistical inversion with combined applications of model-based and deconvolution-based methods. The resulted joint inversion scheme can integrate seismic data, well data, rock physical theory, and geological knowledge for estimation of high-resolution petrophysical parameters. 4)At the aspect of risk assessment of lateral reservoir prediction, this method integrated the seismic lithology identification, petrophysical prediction, multi-scale decomposition of petrophysical parameters, P- and H-spectra, and the match relationship of data got from seismics, well logging and geology. It could describe the complexity of medium preferably. Through applications of the joint inversion of seismic data for lithologic and petrophysical parameters in several selected target areas, the resulted high-resolution lithologic and petrophysical sections(impedance, porosity, clay) show that the joint inversion can significantly improve the spatial description of reservoirs in data sets involving complex deposits. It proved the validity and practicality of this method adequately.
Resumo:
Oliver, A., Freixenet, J., Marti, R., Pont, J., Perez, E., Denton, E. R. E., Zwiggelaar, R. (2008). A novel breast tissue density classification framework. IEEE Transactions on Information Technology in BioMedicine, 12 (1), 55-65
Resumo:
McMillan, P. F., Wilson, M., Wilding, M. C. (2003). Polyamorphism in aluminate liquids. Journal of Physics: Condensed Matter, 15 (36), 6105-6121 RAE2008
Resumo:
Mark Pagel, Andrew Meade (2004). A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Systematic Biology, 53(4), 571-581. RAE2008
Resumo:
We investigate adaptive buffer management techniques for approximate evaluation of sliding window joins over multiple data streams. In many applications, data stream processing systems have limited memory or have to deal with very high speed data streams. In both cases, computing the exact results of joins between these streams may not be feasible, mainly because the buffers used to compute the joins contain much smaller number of tuples than the tuples contained in the sliding windows. Therefore, a stream buffer management policy is needed in that case. We show that the buffer replacement policy is an important determinant of the quality of the produced results. To that end, we propose GreedyDual-Join (GDJ) an adaptive and locality-aware buffering technique for managing these buffers. GDJ exploits the temporal correlations (at both long and short time scales), which we found to be prevalent in many real data streams. We note that our algorithm is readily applicable to multiple data streams and multiple joins and requires almost no additional system resources. We report results of an experimental study using both synthetic and real-world data sets. Our results demonstrate the superiority and flexibility of our approach when contrasted to other recently proposed techniques.
Resumo:
Object detection is challenging when the object class exhibits large within-class variations. In this work, we show that foreground-background classification (detection) and within-class classification of the foreground class (pose estimation) can be jointly learned in a multiplicative form of two kernel functions. One kernel measures similarity for foreground-background classification. The other kernel accounts for latent factors that control within-class variation and implicitly enables feature sharing among foreground training samples. Detector training can be accomplished via standard SVM learning. The resulting detectors are tuned to specific variations in the foreground class. They also serve to evaluate hypotheses of the foreground state. When the foreground parameters are provided in training, the detectors can also produce parameter estimate. When the foreground object masks are provided in training, the detectors can also produce object segmentation. The advantages of our method over past methods are demonstrated on data sets of human hands and vehicles.
Resumo:
Object detection and recognition are important problems in computer vision. The challenges of these problems come from the presence of noise, background clutter, large within class variations of the object class and limited training data. In addition, the computational complexity in the recognition process is also a concern in practice. In this thesis, we propose one approach to handle the problem of detecting an object class that exhibits large within-class variations, and a second approach to speed up the classification processes. In the first approach, we show that foreground-background classification (detection) and within-class classification of the foreground class (pose estimation) can be jointly solved with using a multiplicative form of two kernel functions. One kernel measures similarity for foreground-background classification. The other kernel accounts for latent factors that control within-class variation and implicitly enables feature sharing among foreground training samples. For applications where explicit parameterization of the within-class states is unavailable, a nonparametric formulation of the kernel can be constructed with a proper foreground distance/similarity measure. Detector training is accomplished via standard Support Vector Machine learning. The resulting detectors are tuned to specific variations in the foreground class. They also serve to evaluate hypotheses of the foreground state. When the image masks for foreground objects are provided in training, the detectors can also produce object segmentation. Methods for generating a representative sample set of detectors are proposed that can enable efficient detection and tracking. In addition, because individual detectors verify hypotheses of foreground state, they can also be incorporated in a tracking-by-detection frame work to recover foreground state in image sequences. To run the detectors efficiently at the online stage, an input-sensitive speedup strategy is proposed to select the most relevant detectors quickly. The proposed approach is tested on data sets of human hands, vehicles and human faces. On all data sets, the proposed approach achieves improved detection accuracy over the best competing approaches. In the second part of the thesis, we formulate a filter-and-refine scheme to speed up recognition processes. The binary outputs of the weak classifiers in a boosted detector are used to identify a small number of candidate foreground state hypotheses quickly via Hamming distance or weighted Hamming distance. The approach is evaluated in three applications: face recognition on the face recognition grand challenge version 2 data set, hand shape detection and parameter estimation on a hand data set, and vehicle detection and estimation of the view angle on a multi-pose vehicle data set. On all data sets, our approach is at least five times faster than simply evaluating all foreground state hypotheses with virtually no loss in classification accuracy.
Resumo:
The flower industry has a reputation for heavy usage of toxic chemicals and polluting the environment, enormous consumption of water, and poor working condition and low wage level in various parts of the world. It is unfortunate that this industry is adamant to change and repeating the same mistakes in Ethiopia. Because of this, - there is a growing concern among the general public and the international community about sustainability of the Ethiopian flower industry. Consequently, working conditions in the flower industry, impacts of wage income on the livelihoods of employees, coping strategies of low wage flower farm workers, impacts of flower farms on the livelihoods of local people and environmental pollution and conflict, were analysed. Both qualitative and quantitative research methods were employed. Four quantitative data sets: labour practice, employees’ income and expenditure, displaced household, and flower grower views survey were collected between 2010 and 2012. Robust regression to identify the determinants of wage levels, and Multinomial logit to identify the determinants of coping strategies of flower farm workers and displaced households were employed. The findings show the working conditions in flower farms are characterized by low wages, job insecurity and frequent violation of employees’ rights, and poor safety measures. To ensure survival of their family, land dispossessed households adopt a wide range of strategies including reduction in food consumption, sharing oxen, renting land, share cropping, and shifting staple food crops. Most experienced scarcity of water resources, lack of grazing areas, death of herds and reduced numbers of livestock due to water source pollution. Despite the Ethiopian government investment in attracting and creating conducive environment for investors, not much was accomplished when it comes to enforcing labour laws and environmental policies. Flower farm expansion in Ethiopia, as it is now, can be viewed as part of the global land and water grab and is not all inclusive and sustainable. Several recommendations are made to improve working conditions, maximize the benefits of flower industry to the society, and to the country at large.
Resumo:
As more diagnostic testing options become available to physicians, it becomes more difficult to combine various types of medical information together in order to optimize the overall diagnosis. To improve diagnostic performance, here we introduce an approach to optimize a decision-fusion technique to combine heterogeneous information, such as from different modalities, feature categories, or institutions. For classifier comparison we used two performance metrics: The receiving operator characteristic (ROC) area under the curve [area under the ROC curve (AUC)] and the normalized partial area under the curve (pAUC). This study used four classifiers: Linear discriminant analysis (LDA), artificial neural network (ANN), and two variants of our decision-fusion technique, AUC-optimized (DF-A) and pAUC-optimized (DF-P) decision fusion. We applied each of these classifiers with 100-fold cross-validation to two heterogeneous breast cancer data sets: One of mass lesion features and a much more challenging one of microcalcification lesion features. For the calcification data set, DF-A outperformed the other classifiers in terms of AUC (p < 0.02) and achieved AUC=0.85 +/- 0.01. The DF-P surpassed the other classifiers in terms of pAUC (p < 0.01) and reached pAUC=0.38 +/- 0.02. For the mass data set, DF-A outperformed both the ANN and the LDA (p < 0.04) and achieved AUC=0.94 +/- 0.01. Although for this data set there were no statistically significant differences among the classifiers' pAUC values (pAUC=0.57 +/- 0.07 to 0.67 +/- 0.05, p > 0.10), the DF-P did significantly improve specificity versus the LDA at both 98% and 100% sensitivity (p < 0.04). In conclusion, decision fusion directly optimized clinically significant performance measures, such as AUC and pAUC, and sometimes outperformed two well-known machine-learning techniques when applied to two different breast cancer data sets.
Resumo:
BACKGROUND: There is considerable interest in the development of methods to efficiently identify all coding variants present in large sample sets of humans. There are three approaches possible: whole-genome sequencing, whole-exome sequencing using exon capture methods, and RNA-Seq. While whole-genome sequencing is the most complete, it remains sufficiently expensive that cost effective alternatives are important. RESULTS: Here we provide a systematic exploration of how well RNA-Seq can identify human coding variants by comparing variants identified through high coverage whole-genome sequencing to those identified by high coverage RNA-Seq in the same individual. This comparison allowed us to directly evaluate the sensitivity and specificity of RNA-Seq in identifying coding variants, and to evaluate how key parameters such as the degree of coverage and the expression levels of genes interact to influence performance. We find that although only 40% of exonic variants identified by whole genome sequencing were captured using RNA-Seq; this number rose to 81% when concentrating on genes known to be well-expressed in the source tissue. We also find that a high false positive rate can be problematic when working with RNA-Seq data, especially at higher levels of coverage. CONCLUSIONS: We conclude that as long as a tissue relevant to the trait under study is available and suitable quality control screens are implemented, RNA-Seq is a fast and inexpensive alternative approach for finding coding variants in genes with sufficiently high expression levels.
Resumo:
The Feeding Experiments End-user Database (FEED) is a research tool developed by the Mammalian Feeding Working Group at the National Evolutionary Synthesis Center that permits synthetic, evolutionary analyses of the physiology of mammalian feeding. The tasks of the Working Group are to compile physiologic data sets into a uniform digital format stored at a central source, develop a standardized terminology for describing and organizing the data, and carry out a set of novel analyses using FEED. FEED contains raw physiologic data linked to extensive metadata. It serves as an archive for a large number of existing data sets and a repository for future data sets. The metadata are stored as text and images that describe experimental protocols, research subjects, and anatomical information. The metadata incorporate controlled vocabularies to allow consistent use of the terms used to describe and organize the physiologic data. The planned analyses address long-standing questions concerning the phylogenetic distribution of phenotypes involving muscle anatomy and feeding physiology among mammals, the presence and nature of motor pattern conservation in the mammalian feeding muscles, and the extent to which suckling constrains the evolution of feeding behavior in adult mammals. We expect FEED to be a growing digital archive that will facilitate new research into understanding the evolution of feeding anatomy.