931 resultados para large data sets
Resumo:
One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the approximately 200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE, PROCRUSTES, and BLASTX was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.
Resumo:
Abstract Accurate characterization of the spatial distribution of hydrological properties in heterogeneous aquifers at a range of scales is a key prerequisite for reliable modeling of subsurface contaminant transport, and is essential for designing effective and cost-efficient groundwater management and remediation strategies. To this end, high-resolution geophysical methods have shown significant potential to bridge a critical gap in subsurface resolution and coverage between traditional hydrological measurement techniques such as borehole log/core analyses and tracer or pumping tests. An important and still largely unresolved issue, however, is how to best quantitatively integrate geophysical data into a characterization study in order to estimate the spatial distribution of one or more pertinent hydrological parameters, thus improving hydrological predictions. Recognizing the importance of this issue, the aim of the research presented in this thesis was to first develop a strategy for the assimilation of several types of hydrogeophysical data having varying degrees of resolution, subsurface coverage, and sensitivity to the hydrologic parameter of interest. In this regard a novel simulated annealing (SA)-based conditional simulation approach was developed and then tested in its ability to generate realizations of porosity given crosshole ground-penetrating radar (GPR) and neutron porosity log data. This was done successfully for both synthetic and field data sets. A subsequent issue that needed to be addressed involved assessing the potential benefits and implications of the resulting porosity realizations in terms of groundwater flow and contaminant transport. This was investigated synthetically assuming first that the relationship between porosity and hydraulic conductivity was well-defined. Then, the relationship was itself investigated in the context of a calibration procedure using hypothetical tracer test data. Essentially, the relationship best predicting the observed tracer test measurements was determined given the geophysically derived porosity structure. Both of these investigations showed that the SA-based approach, in general, allows much more reliable hydrological predictions than other more elementary techniques considered. Further, the developed calibration procedure was seen to be very effective, even at the scale of tomographic resolution, for predictions of transport. This also held true at locations within the aquifer where only geophysical data were available. This is significant because the acquisition of hydrological tracer test measurements is clearly more complicated and expensive than the acquisition of geophysical measurements. Although the above methodologies were tested using porosity logs and GPR data, the findings are expected to remain valid for a large number of pertinent combinations of geophysical and borehole log data of comparable resolution and sensitivity to the hydrological target parameter. Moreover, the obtained results allow us to have confidence for future developments in integration methodologies for geophysical and hydrological data to improve the 3-D estimation of hydrological properties.
Resumo:
Background and aim of the study: Genomic gains and losses play a crucial role in the development and progression of DLBCL and are closely related to gene expression profiles (GEP), including the germinal center B-cell like (GCB) and activated B-cell like (ABC) cell of origin (COO) molecular signatures. To identify new oncogenes or tumor suppressor genes (TSG) involved in DLBCL pathogenesis and to determine their prognostic values, an integrated analysis of high-resolution gene expression and copy number profiling was performed. Patients and methods: Two hundred and eight adult patients with de novo CD20+ DLBCL enrolled in the prospective multicentric randomized LNH-03 GELA trials (LNH03-1B, -2B, -3B, 39B, -5B, -6B, -7B) with available frozen tumour samples, centralized reviewing and adequate DNA/RNA quality were selected. 116 patients were treated by Rituximab(R)-CHOP/R-miniCHOP and 92 patients were treated by the high dose (R)-ACVBP regimen dedicated to patients younger than 60 years (y) in frontline. Tumour samples were simultaneously analysed by high resolution comparative genomic hybridization (CGH, Agilent, 144K) and gene expression arrays (Affymetrix, U133+2). Minimal common regions (MCR), as defined by segments that affect the same chromosomal region in different cases, were delineated. Gene expression and MCR data sets were merged using Gene expression and dosage integrator algorithm (GEDI, Lenz et al. PNAS 2008) to identify new potential driver genes. Results: A total of 1363 recurrent (defined by a penetrance > 5%) MCRs within the DLBCL data set, ranging in size from 386 bp, affecting a single gene, to more than 24 Mb were identified by CGH. Of these MCRs, 756 (55%) showed a significant association with gene expression: 396 (59%) gains, 354 (52%) single-copy deletions, and 6 (67%) homozygous deletions. By this integrated approach, in addition to previously reported genes (CDKN2A/2B, PTEN, DLEU2, TNFAIP3, B2M, CD58, TNFRSF14, FOXP1, REL...), several genes targeted by gene copy abnormalities with a dosage effect and potential physiopathological impact were identified, including genes with TSG activity involved in cell cycle (HACE1, CDKN2C) immune response (CD68, CD177, CD70, TNFSF9, IRAK2), DNA integrity (XRCC2, BRCA1, NCOR1, NF1, FHIT) or oncogenic functions (CD79b, PTPRT, MALT1, AUTS2, MCL1, PTTG1...) with distinct distribution according to COO signature. The CDKN2A/2B tumor suppressor locus (9p21) was deleted homozygously in 27% of cases and hemizygously in 9% of cases. Biallelic loss was observed in 49% of ABC DLBCL and in 10% of GCB DLBCL. This deletion was strongly correlated to age and associated to a limited number of additional genetic abnormalities including trisomy 3, 18 and short gains/losses of Chr. 1, 2, 19 regions (FDR < 0.01), allowing to identify genes that may have synergistic effects with CDKN2A/2B inactivation. With a median follow-up of 42.9 months, only CDKN2A/2B biallelic deletion strongly correlates (FDR p.value < 0.01) to a poor outcome in the entire cohort (4y PFS = 44% [32-61] respectively vs. 74% [66-82] for patients in germline configuration; 4y OS = 53% [39-72] vs 83% [76-90]). In a Cox proportional hazard prediction of the PFS, CDKN2A/2B deletion remains predictive (HR = 1.9 [1.1-3.2], p = 0.02) when combined with IPI (HR = 2.4 [1.4-4.1], p = 0.001) and GCB status (HR = 1.3 [0.8-2.3], p = 0.31). This difference remains predictive in the subgroup of patients treated by R-CHOP (4y PFS = 43% [29-63] vs. 66% [55-78], p=0.02), in patients treated by R-ACVBP (4y PFS = 49% [28-84] vs. 83% [74-92], p=0.003), and in GCB (4y PFS = 50% [27-93] vs. 81% [73-90], p=0.02), or ABC/unclassified (5y PFS = 42% [28-61] vs. 67% [55-82] p = 0.009) molecular subtypes (Figure 1). Conclusion: We report for the first time an integrated genetic analysis of a large cohort of DLBCL patients included in a prospective multicentric clinical trial program allowing identifying new potential driver genes with pathogenic impact. However CDKN2A/2B deletion constitutes the strongest and unique prognostic factor of chemoresistance to R-CHOP, regardless the COO signature, which is not overcome by a more intensified immunochemotherapy. Patients displaying this frequent genomic abnormality warrant new and dedicated therapeutic approaches.
Resumo:
Digital information generates the possibility of a high degree of redundancy in the data available for fitting predictive models used for Digital Soil Mapping (DSM). Among these models, the Decision Tree (DT) technique has been increasingly applied due to its capacity of dealing with large datasets. The purpose of this study was to evaluate the impact of the data volume used to generate the DT models on the quality of soil maps. An area of 889.33 km² was chosen in the Northern region of the State of Rio Grande do Sul. The soil-landscape relationship was obtained from reambulation of the studied area and the alignment of the units in the 1:50,000 scale topographic mapping. Six predictive covariates linked to the factors soil formation, relief and organisms, together with data sets of 1, 3, 5, 10, 15, 20 and 25 % of the total data volume, were used to generate the predictive DT models in the data mining program Waikato Environment for Knowledge Analysis (WEKA). In this study, sample densities below 5 % resulted in models with lower power of capturing the complexity of the spatial distribution of the soil in the study area. The relation between the data volume to be handled and the predictive capacity of the models was best for samples between 5 and 15 %. For the models based on these sample densities, the collected field data indicated an accuracy of predictive mapping close to 70 %.
Resumo:
Many of the most interesting questions ecologists ask lead to analyses of spatial data. Yet, perhaps confused by the large number of statistical models and fitting methods available, many ecologists seem to believe this is best left to specialists. Here, we describe the issues that need consideration when analysing spatial data and illustrate these using simulation studies. Our comparative analysis involves using methods including generalized least squares, spatial filters, wavelet revised models, conditional autoregressive models and generalized additive mixed models to estimate regression coefficients from synthetic but realistic data sets, including some which violate standard regression assumptions. We assess the performance of each method using two measures and using statistical error rates for model selection. Methods that performed well included generalized least squares family of models and a Bayesian implementation of the conditional auto-regressive model. Ordinary least squares also performed adequately in the absence of model selection, but had poorly controlled Type I error rates and so did not show the improvements in performance under model selection when using the above methods. Removing large-scale spatial trends in the response led to poor performance. These are empirical results; hence extrapolation of these findings to other situations should be performed cautiously. Nevertheless, our simulation-based approach provides much stronger evidence for comparative analysis than assessments based on single or small numbers of data sets, and should be considered a necessary foundation for statements of this type in future.
Resumo:
BACKGROUND: Genotypes obtained with commercial SNP arrays have been extensively used in many large case-control or population-based cohorts for SNP-based genome-wide association studies for a multitude of traits. Yet, these genotypes capture only a small fraction of the variance of the studied traits. Genomic structural variants (GSV) such as Copy Number Variation (CNV) may account for part of the missing heritability, but their comprehensive detection requires either next-generation arrays or sequencing. Sophisticated algorithms that infer CNVs by combining the intensities from SNP-probes for the two alleles can already be used to extract a partial view of such GSV from existing data sets. RESULTS: Here we present several advances to facilitate the latter approach. First, we introduce a novel CNV detection method based on a Gaussian Mixture Model. Second, we propose a new algorithm, PCA merge, for combining copy-number profiles from many individuals into consensus regions. We applied both our new methods as well as existing ones to data from 5612 individuals from the CoLaus study who were genotyped on Affymetrix 500K arrays. We developed a number of procedures in order to evaluate the performance of the different methods. This includes comparison with previously published CNVs as well as using a replication sample of 239 individuals, genotyped with Illumina 550K arrays. We also established a new evaluation procedure that employs the fact that related individuals are expected to share their CNVs more frequently than randomly selected individuals. The ability to detect both rare and common CNVs provides a valuable resource that will facilitate association studies exploring potential phenotypic associations with CNVs. CONCLUSION: Our new methodologies for CNV detection and their evaluation will help in extracting additional information from the large amount of SNP-genotyping data on various cohorts and use this to explore structural variants and their impact on complex traits.
Resumo:
To date, published studies of alluvial bar architecture in large rivers have been restricted mostly to case studies of individual bars and single locations. Relatively little is known about how the depositional processes and sedimentary architecture of kilometre-scale bars vary within a multi-kilometre reach or over several hundreds of kilometres downstream. This study presents Ground Penetrating Radar and core data from 11, kilometre-scale bars from the Rio Parana, Argentina. The investigated bars are located between 30km upstream and 540km downstream of the Rio Parana - Rio Paraguay confluence, where a significant volume of fine-grained suspended sediment is introduced into the network. Bar-scale cross-stratified sets, with lengths and widths up to 600m and thicknesses up to 12m, enable the distinction of large river deposits from stacked deposits of smaller rivers, but are only present in half the surface area of the bars. Up to 90% of bar-scale sets are found on top of finer-grained ripple-laminated bar-trough deposits. Bar-scale sets make up as much as 58% of the volume of the deposits in small, incipient mid-channel bars, but this proportion decreases significantly with increasing age and size of the bars. Contrary to what might be expected, a significant proportion of the sedimentary structures found in the Rio Parana is similar in scale to those found in much smaller rivers. In other words, large river deposits are not always characterized by big structures that allow a simple interpretation of river scale. However, the large scale of the depositional units in big rivers causes small-scale structures, such as ripple sets, to be grouped into thicker cosets, which indicate river scale even when no obvious large-scale sets are present. The results also show that the composition of bars differs between the studied reaches upstream and downstream of the confluence with the Rio Paraguay. Relative to other controls on downstream fining, the tributary input of fine-grained suspended material from the Rio Paraguay causes a marked change in the composition of the bar deposits. Compared to the upstream reaches, the sedimentary architecture of the downstream reaches in the top ca 5m of mid-channel bars shows: (i) an increase in the abundance and thickness (up to metre-scale) of laterally extensive (hundreds of metres) fine-grained layers; (ii) an increase in the percentage of deposits comprised of ripple sets (to >40% in the upper bar deposits); and (iii) an increase in bar-trough deposits and a corresponding decrease in bar-scale cross-strata (<10%). The thalweg deposits of the Rio Parana are composed of dune sets, even directly downstream from the Rio Paraguay where the upper channel deposits are dominantly fine-grained. Thus, the change in sedimentary facies due to a tributary point-source of fine-grained sediment is primarily expressed in the composition of the upper bar deposits.
Resumo:
We survey a number of papers that have focused on the construction of cross-country data sets on average years of schooling. We discuss the construction of the different series, compare their profiles and construct indicators of their information content. The discussion focuses on a sample of OECD countries but we also provide some results for a large non-OECD sample.
Resumo:
The recent rapid development of biotechnological approaches has enabled the production of large whole genome level biological data sets. In order to handle thesedata sets, reliable and efficient automated tools and methods for data processingand result interpretation are required. Bioinformatics, as the field of studying andprocessing biological data, tries to answer this need by combining methods and approaches across computer science, statistics, mathematics and engineering to studyand process biological data. The need is also increasing for tools that can be used by the biological researchers themselves who may not have a strong statistical or computational background, which requires creating tools and pipelines with intuitive user interfaces, robust analysis workflows and strong emphasis on result reportingand visualization. Within this thesis, several data analysis tools and methods have been developed for analyzing high-throughput biological data sets. These approaches, coveringseveral aspects of high-throughput data analysis, are specifically aimed for gene expression and genotyping data although in principle they are suitable for analyzing other data types as well. Coherent handling of the data across the various data analysis steps is highly important in order to ensure robust and reliable results. Thus,robust data analysis workflows are also described, putting the developed tools andmethods into a wider context. The choice of the correct analysis method may also depend on the properties of the specific data setandthereforeguidelinesforchoosing an optimal method are given. The data analysis tools, methods and workflows developed within this thesis have been applied to several research studies, of which two representative examplesare included in the thesis. The first study focuses on spermatogenesis in murinetestis and the second one examines cell lineage specification in mouse embryonicstem cells.
Resumo:
Our surrounding landscape is in a constantly dynamic state, but recently the rate of changes and their effects on the environment have considerably increased. In terms of the impact on nature, this development has not been entirely positive, but has rather caused a decline in valuable species, habitats, and general biodiversity. Regardless of recognizing the problem and its high importance, plans and actions of how to stop the detrimental development are largely lacking. This partly originates from a lack of genuine will, but is also due to difficulties in detecting many valuable landscape components and their consequent neglect. To support knowledge extraction, various digital environmental data sources may be of substantial help, but only if all the relevant background factors are known and the data is processed in a suitable way. This dissertation concentrates on detecting ecologically valuable landscape components by using geospatial data sources, and applies this knowledge to support spatial planning and management activities. In other words, the focus is on observing regionally valuable species, habitats, and biotopes with GIS and remote sensing data, using suitable methods for their analysis. Primary emphasis is given to the hemiboreal vegetation zone and the drastic decline in its semi-natural grasslands, which were created by a long trajectory of traditional grazing and management activities. However, the applied perspective is largely methodological, and allows for the application of the obtained results in various contexts. Models based on statistical dependencies and correlations of multiple variables, which are able to extract desired properties from a large mass of initial data, are emphasized in the dissertation. In addition, the papers included combine several data sets from different sources and dates together, with the aim of detecting a wider range of environmental characteristics, as well as pointing out their temporal dynamics. The results of the dissertation emphasise the multidimensionality and dynamics of landscapes, which need to be understood in order to be able to recognise their ecologically valuable components. This not only requires knowledge about the emergence of these components and an understanding of the used data, but also the need to focus the observations on minute details that are able to indicate the existence of fragmented and partly overlapping landscape targets. In addition, this pinpoints the fact that most of the existing classifications are too generalised as such to provide all the required details, but they can be utilized at various steps along a longer processing chain. The dissertation also emphases the importance of landscape history as an important factor, which both creates and preserves ecological values, and which sets an essential standpoint for understanding the present landscape characteristics. The obtained results are significant both in terms of preserving semi-natural grasslands, as well as general methodological development, giving support to science-based framework in order to evaluate ecological values and guide spatial planning.
Resumo:
Le but de cette thèse est d étendre la théorie du bootstrap aux modèles de données de panel. Les données de panel s obtiennent en observant plusieurs unités statistiques sur plusieurs périodes de temps. Leur double dimension individuelle et temporelle permet de contrôler l 'hétérogénéité non observable entre individus et entre les périodes de temps et donc de faire des études plus riches que les séries chronologiques ou les données en coupe instantanée. L 'avantage du bootstrap est de permettre d obtenir une inférence plus précise que celle avec la théorie asymptotique classique ou une inférence impossible en cas de paramètre de nuisance. La méthode consiste à tirer des échantillons aléatoires qui ressemblent le plus possible à l échantillon d analyse. L 'objet statitstique d intérêt est estimé sur chacun de ses échantillons aléatoires et on utilise l ensemble des valeurs estimées pour faire de l inférence. Il existe dans la littérature certaines application du bootstrap aux données de panels sans justi cation théorique rigoureuse ou sous de fortes hypothèses. Cette thèse propose une méthode de bootstrap plus appropriée aux données de panels. Les trois chapitres analysent sa validité et son application. Le premier chapitre postule un modèle simple avec un seul paramètre et s 'attaque aux propriétés théoriques de l estimateur de la moyenne. Nous montrons que le double rééchantillonnage que nous proposons et qui tient compte à la fois de la dimension individuelle et la dimension temporelle est valide avec ces modèles. Le rééchantillonnage seulement dans la dimension individuelle n est pas valide en présence d hétérogénéité temporelle. Le ré-échantillonnage dans la dimension temporelle n est pas valide en présence d'hétérogénéité individuelle. Le deuxième chapitre étend le précédent au modèle panel de régression. linéaire. Trois types de régresseurs sont considérés : les caractéristiques individuelles, les caractéristiques temporelles et les régresseurs qui évoluent dans le temps et par individu. En utilisant un modèle à erreurs composées doubles, l'estimateur des moindres carrés ordinaires et la méthode de bootstrap des résidus, on montre que le rééchantillonnage dans la seule dimension individuelle est valide pour l'inférence sur les coe¢ cients associés aux régresseurs qui changent uniquement par individu. Le rééchantillonnage dans la dimen- sion temporelle est valide seulement pour le sous vecteur des paramètres associés aux régresseurs qui évoluent uniquement dans le temps. Le double rééchantillonnage est quand à lui est valide pour faire de l inférence pour tout le vecteur des paramètres. Le troisième chapitre re-examine l exercice de l estimateur de différence en di¤érence de Bertrand, Duflo et Mullainathan (2004). Cet estimateur est couramment utilisé dans la littérature pour évaluer l impact de certaines poli- tiques publiques. L exercice empirique utilise des données de panel provenant du Current Population Survey sur le salaire des femmes dans les 50 états des Etats-Unis d Amérique de 1979 à 1999. Des variables de pseudo-interventions publiques au niveau des états sont générées et on s attend à ce que les tests arrivent à la conclusion qu il n y a pas d e¤et de ces politiques placebos sur le salaire des femmes. Bertrand, Du o et Mullainathan (2004) montre que la non-prise en compte de l hétérogénéité et de la dépendance temporelle entraîne d importantes distorsions de niveau de test lorsqu'on évalue l'impact de politiques publiques en utilisant des données de panel. Une des solutions préconisées est d utiliser la méthode de bootstrap. La méthode de double ré-échantillonnage développée dans cette thèse permet de corriger le problème de niveau de test et donc d'évaluer correctement l'impact des politiques publiques.
Resumo:
L'increment de bases de dades que cada vegada contenen imatges més difícils i amb un nombre més elevat de categories, està forçant el desenvolupament de tècniques de representació d'imatges que siguin discriminatives quan es vol treballar amb múltiples classes i d'algorismes que siguin eficients en l'aprenentatge i classificació. Aquesta tesi explora el problema de classificar les imatges segons l'objecte que contenen quan es disposa d'un gran nombre de categories. Primerament s'investiga com un sistema híbrid format per un model generatiu i un model discriminatiu pot beneficiar la tasca de classificació d'imatges on el nivell d'anotació humà sigui mínim. Per aquesta tasca introduïm un nou vocabulari utilitzant una representació densa de descriptors color-SIFT, i desprès s'investiga com els diferents paràmetres afecten la classificació final. Tot seguit es proposa un mètode par tal d'incorporar informació espacial amb el sistema híbrid, mostrant que la informació de context es de gran ajuda per la classificació d'imatges. Desprès introduïm un nou descriptor de forma que representa la imatge segons la seva forma local i la seva forma espacial, tot junt amb un kernel que incorpora aquesta informació espacial en forma piramidal. La forma es representada per un vector compacte obtenint un descriptor molt adequat per ésser utilitzat amb algorismes d'aprenentatge amb kernels. Els experiments realitzats postren que aquesta informació de forma te uns resultats semblants (i a vegades millors) als descriptors basats en aparença. També s'investiga com diferents característiques es poden combinar per ésser utilitzades en la classificació d'imatges i es mostra com el descriptor de forma proposat juntament amb un descriptor d'aparença millora substancialment la classificació. Finalment es descriu un algoritme que detecta les regions d'interès automàticament durant l'entrenament i la classificació. Això proporciona un mètode per inhibir el fons de la imatge i afegeix invariança a la posició dels objectes dins les imatges. S'ensenya que la forma i l'aparença sobre aquesta regió d'interès i utilitzant els classificadors random forests millora la classificació i el temps computacional. Es comparen els postres resultats amb resultats de la literatura utilitzant les mateixes bases de dades que els autors Aixa com els mateixos protocols d'aprenentatge i classificació. Es veu com totes les innovacions introduïdes incrementen la classificació final de les imatges.
Resumo:
Novel imaging techniques are playing an increasingly important role in drug development, providing insight into the mechanism of action of new chemical entities. The data sets obtained by these methods can be large with complex inter-relationships, but the most appropriate statistical analysis for handling this data is often uncertain - precisely because of the exploratory nature of the way the data are collected. We present an example from a clinical trial using magnetic resonance imaging to assess changes in atherosclerotic plaques following treatment with a tool compound with established clinical benefit. We compared two specific approaches to handle the correlations due to physical location and repeated measurements: two-level and four-level multilevel models. The two methods identified similar structural variables, but higher level multilevel models had the advantage of explaining a greater proportion of variation, and the modeling assumptions appeared to be better satisfied.
Resumo:
Advances in hardware and software technology enable us to collect, store and distribute large quantities of data on a very large scale. Automatically discovering and extracting hidden knowledge in the form of patterns from these large data volumes is known as data mining. Data mining technology is not only a part of business intelligence, but is also used in many other application areas such as research, marketing and financial analytics. For example medical scientists can use patterns extracted from historic patient data in order to determine if a new patient is likely to respond positively to a particular treatment or not; marketing analysts can use extracted patterns from customer data for future advertisement campaigns; finance experts have an interest in patterns that forecast the development of certain stock market shares for investment recommendations. However, extracting knowledge in the form of patterns from massive data volumes imposes a number of computational challenges in terms of processing time, memory, bandwidth and power consumption. These challenges have led to the development of parallel and distributed data analysis approaches and the utilisation of Grid and Cloud computing. This chapter gives an overview of parallel and distributed computing approaches and how they can be used to scale up data mining to large datasets.
Resumo:
One of the most pervasive assumptions about human brain evolution is that it involved relative enlargement of the frontal lobes. We show that this assumption is without foundation. Analysis of five independent data sets using correctly scaled measures and phylogenetic methods reveals that the size of human frontal lobes, and of specific frontal regions, is as expected relative to the size of other brain structures. Recent claims for relative enlargement of human frontal white matter volume, and for relative enlargement shared by all great apes, seem to be mistaken. Furthermore, using a recently developed method for detecting shifts in evolutionary rates, we find that the rate of change in relative frontal cortex volume along the phylogenetic branch leading to humans was unremarkable and that other branches showed significantly faster rates of change. Although absolute and proportional frontal region size increased rapidly in humans, this change was tightly correlated with corresponding size increases in other areas andwhole brain size, and with decreases in frontal neuron densities. The search for the neural basis of human cognitive uniqueness should therefore focus less on the frontal lobes in isolation and more on distributed neural networks.