972 resultados para Data Warehouse Hadoop Spark GMQL HDFS YARN MapReduce genomica bioinformatica dipendenze funzionali


Relevância:

40.00% 40.00%

Publicador:

Resumo:

A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Cloud computing offers massive scalability and elasticity required by many scien-tific and commercial applications. Combining the computational and data handling capabilities of clouds with parallel processing also has the potential to tackle Big Data problems efficiently. Science gateway frameworks and workflow systems enable application developers to implement complex applications and make these available for end-users via simple graphical user interfaces. The integration of such frameworks with Big Data processing tools on the cloud opens new oppor-tunities for application developers. This paper investigates how workflow sys-tems and science gateways can be extended with Big Data processing capabilities. A generic approach based on infrastructure aware workflows is suggested and a proof of concept is implemented based on the WS-PGRADE/gUSE science gateway framework and its integration with the Hadoop parallel data processing solution based on the MapReduce paradigm in the cloud. The provided analysis demonstrates that the methods described to integrate Big Data processing with workflows and science gateways work well in different cloud infrastructures and application scenarios, and can be used to create massively parallel applications for scientific analysis of Big Data.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Dissertação de mestrado integrado em Engenharia e Gestão Industrial

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Con la mayor capacidad de los nodos de procesamiento en relación a la potencia de cómputo, cada vez más aplicaciones intensivas de datos como las aplicaciones de la bioinformática, se llevarán a ejecutar en clusters no dedicados. Los clusters no dedicados se caracterizan por su capacidad de combinar la ejecución de aplicaciones de usuarios locales con aplicaciones, científicas o comerciales, ejecutadas en paralelo. Saber qué efecto las aplicaciones con acceso intensivo a dados producen respecto a la mezcla de otro tipo (batch, interativa, SRT, etc) en los entornos no-dedicados permite el desarrollo de políticas de planificación más eficientes. Algunas de las aplicaciones intensivas de E/S se basan en el paradigma MapReduce donde los entornos que las utilizan, como Hadoop, se ocupan de la localidad de los datos, balanceo de carga de forma automática y trabajan con sistemas de archivos distribuidos. El rendimiento de Hadoop se puede mejorar sin aumentar los costos de hardware, al sintonizar varios parámetros de configuración claves para las especificaciones del cluster, para el tamaño de los datos de entrada y para el procesamiento complejo. La sincronización de estos parámetros de sincronización puede ser demasiado compleja para el usuario y/o administrador pero procura garantizar prestaciones más adecuadas. Este trabajo propone la evaluación del impacto de las aplicaciones intensivas de E/S en la planificación de trabajos en clusters no-dedicados bajo los paradigmas MPI y Mapreduce.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

L’èxit del Projecte Genoma Humà (PGH) l’any 2000 va fer de la “medicina personalitzada” una realitat més propera. Els descobriments del PGH han simplificat les tècniques de seqüenciació de tal manera que actualment qualsevol persona pot aconseguir la seva seqüència d’ADN complerta. La tecnologia de Read Mapping destaca en aquest tipus de tècniques i es caracteritza per manegar una gran quantitat de dades. Hadoop, el framework d’Apache per aplicacions intensives de dades sota el paradigma Map Reduce, resulta un aliat perfecte per aquest tipus de tecnologia i ha sigut l’opció escollida per a realitzar aquest projecte. Durant tot el treball es realitza l’estudi, l’anàlisi i les experimentacions necessàries per aconseguir un Algorisme Genètic innovador que utilitzi tot el potencial de Hadoop.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Cada vez es mayor el número de aplicaciones desarrolladas en el ámbito científico, como en la Bioinformática o en las Geociencias, escritas bajo el modelo MapReduce, empleando herramientas de código abierto como Apache Hadoop. De la necesidad de integrar Hadoop en entornos HPC, para posibilitar la ejecutar aplicaciones desarrolladas bajo el paradigma MapReduce, nace el presente proyecto. Se analizan dos frameworks diseñados para facilitar dicha integración a los desarrolladores: HoD y myHadoop. En este proyecto se analiza, tanto las posibilidades en cuanto a entornos que ofrecen dichos frameworks para la ejecución de aplicaciones MapReduce, como el rendimiento de los clúster Hadoop generados con HoD o myHadoop respecto a un clúster Hadoop físico.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Avui en dia es genera un volum increïble de dades de diferents tipus i que provenen de multitud d'orígens. Els sistemes d'emmagatzematge i processament distribuït són els elements tecnològics que fan possible capturar aquest allau de dades i permeten donar-ne un valor a través d'anàlisis diversos. Hadoop, que integra un sistema d'emmagatzematge i processament distribuïts, s'ha convertit en l'estàndard de-facto per a aplicacions que necessiten una gran capacitat d'emmagatzematge, inclús de l'ordre de desenes de PBs. En aquest treball farem un estudi de Hadoop, analitzarem l'eficiència del seu sistema de durabilitat i en proposarem una alternativa.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Degut al gran interès actual per instal·lar clústers dedicats al tractament de dades amb Hadoop, s'ha dissenyat una distribució de Linux que automatitza totes les tasques associades. Aquesta distribució permet fer el desplegament sobre un clúster i realitzar una configuració bàsica del mateix de la forma més desatesa possible.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The target of the thesis was to find out has the decision to outsource part of Filtronic LK warehouse function been profitable. Furthermore, another thesis target was to demonstrate current logistics processes between TPLP and company and find out the targets for developing these processes. The decision to outsource part of logistical funtions have been profitable during the first business year. Partnership includes always business risks. Risk increases high asset specific investments. In the other hand investment to partnership increases mutual trust and commitment between parties. By developing partnership risks and opportunitic behaviour can be decreased. The potential of managing material and data flows between logistic service provider and company observed. By analyzing inventory effiency were highlighted the need for decreasing the capital invested to inventories. The recommendations for managing outsourced logistical funtions were established such as improving partnership, process development, performance measurement and invoice checking.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

After decades of mergers and acquisitions and successive technology trends such as CRM, ERP and DW, the data in enterprise systems is scattered and inconsistent. Global organizations face the challenge of addressing local uses of shared business entities, such as customer and material, and at the same time have a consistent, unique, and consolidate view of financial indicators. In addition, current enterprise systems do not accommodate the pace of organizational changes and immense efforts are required to maintain data. When it comes to systems integration, ERPs are considered “closed” and expensive. Data structures are complex and the “out-of-the-box” integration options offered are not based on industry standards. Therefore expensive and time-consuming projects are undertaken in order to have required data flowing according to business processes needs. Master Data Management (MDM) emerges as one discipline focused on ensuring long-term data consistency. Presented as a technology-enabled business discipline, it emphasizes business process and governance to model and maintain the data related to key business entities. There are immense technical and organizational challenges to accomplish the “single version of the truth” MDM mantra. Adding one central repository of master data might prove unfeasible in a few scenarios, thus an incremental approach is recommended, starting from areas most critically affected by data issues. This research aims at understanding the current literature on MDM and contrasting it with views from professionals. The data collected from interviews revealed details on the complexities of data structures and data management practices in global organizations, reinforcing the call for more in-depth research on organizational aspects of MDM. The most difficult piece of master data to manage is the “local” part, the attributes related to the sourcing and storing of materials in one particular warehouse in The Netherlands or a complex set of pricing rules for a subsidiary of a customer in Brazil. From a practical perspective, this research evaluates one MDM solution under development at a Finnish IT solution-provider. By means of applying an existing assessment method, the research attempts at providing the company with one possible tool to evaluate its product from a vendor-agnostics perspective.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently been proposed as an alternative to the popular Random Forests classifier, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classification approaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a recently proposed parallel version of it called Parallel Random Prism. Parallel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the first time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Parallel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Expressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision making increases the user’s trust in the system.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The objective of the present article is to assess and compare the performance of electricity generation systems integrated with downdraft biomass gasifiers for distributed power generation. A model for estimating the electric power generation of internal combustion engines and gas turbines powered by syngas was developed. First, the model determines the syngas composition and the lower heating value; and second, these data are used to evaluate power generation in Otto, Diesel, and Brayton cycles. Four synthesis gas compositions were tested for gasification with: air; pure oxygen; 60% oxygen with 40% steam; and 60% air with 40% steam. The results show a maximum power ratio of 0.567 kWh/Nm(3) for the gas turbine system, 0.647 kWh/Nm(3) for the compression ignition engine, and 0.775 kWh/Nm(3) for the spark-ignition engine while running on synthesis gas which was produced using pure oxygen as gasification agent. When these three systems run on synthesis gas produced using atmospheric air as gasification agent, the maximum power ratios were 0.274 kWh/Nm(3) for the gas turbine system, 0.302 kWh/Nm(3) for CIE, and 0.282 kWh/Nm(3) for SIE. The relationship between power output and synthesis gas flow variations is presented as is the dependence of efficiency on compression ratios. Since the maximum attainable power ratio of CIE is higher than that of SIE for gasification with air, more research should be performed on utilization of synthesis gas in CIE. (C) 2014 Elsevier Ltd. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In the supply chain management there are several risk factors that must be mitigated to increase the flow of production and as a possible solution the literature cites the implementation of a warehouse management system, but this subject is few explored. This thesis has as main objective the study of the implementation of a warehouse management system in a company from the automotive sector that produces clutches. As results, are shown data of the characterization of items; as well as data and comparisons between disruptions in production reports due to lack of material before and after the implementation of WMS and is presented the result of a questionnaire applied to the involved on the implementation of the system, the results were associated with the risk factors on the implementation of the system studied on the literature review, and enumeration of the results that are not associated with any factors previously studied. And finally, the study is concluded and are recommended future studies related to the theme

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The spark plasma sintering (SPS) technique, by using a compacting pressure of 50 MPa, was used to consolidate pre-reacted powders of Bi1.65Pb0.35Sr2Ca2Cu3O10+delta (Bi-2223). The influence of the consolidation temperature, T-D, on the structural and electrical properties has been investigated and compared with those of a reference sample synthesized by the traditional solid-state reaction method and subjected to the same compacting pressure. From the X-ray diffraction patterns, performed in both powder and pellet samples, we have found that the dominant phase is the Bi-2223 in all samples but traces of the Bi2Sr2CaCu2O8+x (Bi-2212) were identified. Their relative density were similar to 85% of the theoretical density and the temperature dependence of the electrical resistivity, rho(T), indicated that increasing T-D results in samples with low oxygen content because the SPS is performed in vacuum. Features of the rho(T) data, as the occurrence of normal-state semiconductor-like behavior of rho(T) and the double resistive superconducting transition, are consistent with samples comprised of grains with shell-core morphology in which the shell is oxygen deficient. The SPS samples also exhibited superconducting critical current density at 77 K, J(c)(77K), between 2 and 10A/cm(2), values much smaller than similar to 22A/cm(2) measured in the reference sample. Reoxygenation of the SPS samples, post-annealed in air at different temperatures and times, was found to improve their microstructural and transport properties. Besides the suppression of the Bragg peaks belonging to the Bi-2212 phase, the superconducting properties of the post-annealed samples and particularly J(c)(77K) were comparable or better than those corresponding to the reference sample. Post-annealed samples at 750 degrees C for 5min exhibited J(c)(77K) similar to 130A/cm(2) even when uniaxially pressed at only 50 MPa. (C) 2012 American Institute of Physics. [http://dx.doi.org/10.1063/1.4768257]