2 resultados para Big data, Spark, Hadoop

em DigitalCommons@University of Nebraska - Lincoln


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Hundreds of Terabytes of CMS (Compact Muon Solenoid) data are being accumulated for storage day by day at the University of Nebraska-Lincoln, which is one of the eight US CMS Tier-2 sites. Managing this data includes retaining useful CMS data sets and clearing storage space for newly arriving data by deleting less useful data sets. This is an important task that is currently being done manually and it requires a large amount of time. The overall objective of this study was to develop a methodology to help identify the data sets to be deleted when there is a requirement for storage space. CMS data is stored using HDFS (Hadoop Distributed File System). HDFS logs give information regarding file access operations. Hadoop MapReduce was used to feed information in these logs to Support Vector Machines (SVMs), a machine learning algorithm applicable to classification and regression which is used in this Thesis to develop a classifier. Time elapsed in data set classification by this method is dependent on the size of the input HDFS log file since the algorithmic complexities of Hadoop MapReduce algorithms here are O(n). The SVM methodology produces a list of data sets for deletion along with their respective sizes. This methodology was also compared with a heuristic called Retention Cost which was calculated using size of the data set and the time since its last access to help decide how useful a data set is. Accuracies of both were compared by calculating the percentage of data sets predicted for deletion which were accessed at a later instance of time. Our methodology using SVMs proved to be more accurate than using the Retention Cost heuristic. This methodology could be used to solve similar problems involving other large data sets.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Townsend’s big-eared bat, Corynorhinus townsendii, is distributed broadly across western North America and in two isolated, endangered populations in central and eastern United States. There are five subspecies of C. townsendii; C. t. pallescens, C. t. australis, C. t. townsendii, C. t. ingens, and C. t. virginianus with varying degrees of concern over the conservation status of each. The aim of this study was to use mitochondrial and microsatellite DNA data to examine genetic diversity, population differentiation, and dispersal of three C. townsendii subspecies. C. t. virginianus is found in isolated populations in the eastern United States and was listed as endangered under the Endangered Species Act in 1979. Concern also exists about declining populations of two western subspecies, C. t. pallescens and C. t. townsendii. Using a comparative approach, estimates of the genetic diversity within populations of the endangered subspecies, C. t. virginianus, were found to be significantly lower than within populations of the two western subspecies. Further, both classes of molecular markers revealed significant differentiation among regional populations of C. t. virginianus with most genetic diversity distributed among populations. Genetic diversity was not significantly different between C. t. townsendii and C. t. pallescens. Some populations of C. t. townsendii are not genetically differentiated from populations of C. t. pallescens in areas of sympatry. For the western subspecies gene flow appears to occur primarily through male dispersal. Finally, geographic regions representing significantly differentiated and genetically unique populations of C. townsendii virginianus are recognized as distinct evolutionary significant units.