861 resultados para multiple data sources


Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the explosive growth of the volume and complexity of document data (e.g., news, blogs, web pages), it has become a necessity to semantically understand documents and deliver meaningful information to users. Areas dealing with these problems are crossing data mining, information retrieval, and machine learning. For example, document clustering and summarization are two fundamental techniques for understanding document data and have attracted much attention in recent years. Given a collection of documents, document clustering aims to partition them into different groups to provide efficient document browsing and navigation mechanisms. One unrevealed area in document clustering is that how to generate meaningful interpretation for the each document cluster resulted from the clustering process. Document summarization is another effective technique for document understanding, which generates a summary by selecting sentences that deliver the major or topic-relevant information in the original documents. How to improve the automatic summarization performance and apply it to newly emerging problems are two valuable research directions. To assist people to capture the semantics of documents effectively and efficiently, the dissertation focuses on developing effective data mining and machine learning algorithms and systems for (1) integrating document clustering and summarization to obtain meaningful document clusters with summarized interpretation, (2) improving document summarization performance and building document understanding systems to solve real-world applications, and (3) summarizing the differences and evolution of multiple document sources.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The generation of heterogeneous big data sources with ever increasing volumes, velocities and veracities over the he last few years has inspired the data science and research community to address the challenge of extracting knowledge form big data. Such a wealth of generated data across the board can be intelligently exploited to advance our knowledge about our environment, public health, critical infrastructure and security. In recent years we have developed generic approaches to process such big data at multiple levels for advancing decision-support. It specifically concerns data processing with semantic harmonisation, low level fusion, analytics, knowledge modelling with high level fusion and reasoning. Such approaches will be introduced and presented in context of the TRIDEC project results on critical oil and gas industry drilling operations and also the ongoing large eVacuate project on critical crowd behaviour detection in confined spaces.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-08

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The last decades have been characterized by a continuous adoption of IT solutions in the healthcare sector, which resulted in the proliferation of tremendous amounts of data over heterogeneous systems. Distinct data types are currently generated, manipulated, and stored, in the several institutions where patients are treated. The data sharing and an integrated access to this information will allow extracting relevant knowledge that can lead to better diagnostics and treatments. This thesis proposes new integration models for gathering information and extracting knowledge from multiple and heterogeneous biomedical sources. The scenario complexity led us to split the integration problem according to the data type and to the usage specificity. The first contribution is a cloud-based architecture for exchanging medical imaging services. It offers a simplified registration mechanism for providers and services, promotes remote data access, and facilitates the integration of distributed data sources. Moreover, it is compliant with international standards, ensuring the platform interoperability with current medical imaging devices. The second proposal is a sensor-based architecture for integration of electronic health records. It follows a federated integration model and aims to provide a scalable solution to search and retrieve data from multiple information systems. The last contribution is an open architecture for gathering patient-level data from disperse and heterogeneous databases. All the proposed solutions were deployed and validated in real world use cases.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This qualitative case study explored three teacher candidates’ learning and enactment of discourse-focused mathematics teaching practices. Using audio and video recordings of their teaching practice this study aimed to identify the shifts in the way in which the teacher candidates enacted the following discourse practices: elicited and used evidence of student thinking, posed purposeful questions, and facilitated meaningful mathematical discourse. The teacher candidates’ written reflections from their practice-based coursework as well as interviews were examined to see how two mathematics methods courses influenced their learning and enactment of the three discourse focused mathematics teaching practices. These data sources were also used to identify tensions the teacher candidates encountered. All three candidates in the study were able to successfully enact and reflect on these discourse-focused mathematics teaching practices at various time points in their preparation programs. Consistency of use and areas of improvement differed, however, depending on various tensions experienced by each candidate. Access to quality curriculum materials as well as time to formulate and enact thoughtful lesson plans that supported classroom discourse were tensions for these teacher candidates. This study shows that teacher candidates are capable of enacting discourse-focused teaching practices early in their field placements and with the support of practice-based coursework they can analyze and reflect on their practice for improvement. This study also reveals the importance of assisting teacher candidates in accessing rich mathematical tasks and collaborating during lesson planning. More research needs to be explored to identify how specific aspects of the learning cycle impact individual teachers and how this can be used to improve practice-based teacher education courses.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Credible spatial information characterizing the structure and site quality of forests is critical to sustainable forest management and planning, especially given the increasing demands and threats to forest products and services. Forest managers and planners are required to evaluate forest conditions over a broad range of scales, contingent on operational or reporting requirements. Traditionally, forest inventory estimates are generated via a design-based approach that involves generalizing sample plot measurements to characterize an unknown population across a larger area of interest. However, field plot measurements are costly and as a consequence spatial coverage is limited. Remote sensing technologies have shown remarkable success in augmenting limited sample plot data to generate stand- and landscape-level spatial predictions of forest inventory attributes. Further enhancement of forest inventory approaches that couple field measurements with cutting edge remotely sensed and geospatial datasets are essential to sustainable forest management. We evaluated a novel Random Forest based k Nearest Neighbors (RF-kNN) imputation approach to couple remote sensing and geospatial data with field inventory collected by different sampling methods to generate forest inventory information across large spatial extents. The forest inventory data collected by the FIA program of US Forest Service was integrated with optical remote sensing and other geospatial datasets to produce biomass distribution maps for a part of the Lake States and species-specific site index maps for the entire Lake State. Targeting small-area application of the state-of-art remote sensing, LiDAR (light detection and ranging) data was integrated with the field data collected by an inexpensive method, called variable plot sampling, in the Ford Forest of Michigan Tech to derive standing volume map in a cost-effective way. The outputs of the RF-kNN imputation were compared with independent validation datasets and extant map products based on different sampling and modeling strategies. The RF-kNN modeling approach was found to be very effective, especially for large-area estimation, and produced results statistically equivalent to the field observations or the estimates derived from secondary data sources. The models are useful to resource managers for operational and strategic purposes.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Objective: To illustrate methodological issues involved in estimating dietary trends in populations using data obtained from various sources in Australia in the 1980s and 1990s. Methods: Estimates of absolute and relative change in consumption of selected food items were calculated using national data published annually on the national food supply for 1982-83 to 1992-93 and responses to food frequency questions in two population based risk factor surveys in 1983 and 1994 in the Hunter Region of New South Wales, Australia. The validity of estimated food quantities obtained from these inexpensive sources at the beginning of the period was assessed by comparison with data from a national dietary survey conducted in 1983 using 24 h recall. Results: Trend estimates from the food supply data and risk factor survey data were in good agreement for increases in consumption of fresh fruit, vegetables and breakfast food and decreases in butter, margarine, sugar and alcohol. Estimates for trends in milk, eggs and bread consumption, however, were inconsistent. Conclusions: Both data sources can be used for monitoring progress towards national nutrition goals based on selected food items provided that some limitations are recognized. While data collection methods should be consistent over time they also need to allow for changes in the food supply (for example the introduction of new varieties such as low-fat dairy products). From time to time the trends derived from these inexpensive data sources should be compared with data derived from more detailed and quantitative estimates of dietary intake.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

There are two main types of data sources of income distributions in China: household survey data and grouped data. Household survey data are typically available for isolated years and individual provinces. In comparison, aggregate or grouped data are typically available more frequently and usually have national coverage. In principle, grouped data allow investigation of the change of inequality over longer, continuous periods of time, and the identification of patterns of inequality across broader regions. Nevertheless, a major limitation of grouped data is that only mean (average) income and income shares of quintile or decile groups of the population are reported. Directly using grouped data reported in this format is equivalent to assuming that all individuals in a quintile or decile group have the same income. This potentially distorts the estimate of inequality within each region. The aim of this paper is to apply an improved econometric method designed to use grouped data to study income inequality in China. A generalized beta distribution is employed to model income inequality in China at various levels and periods of time. The generalized beta distribution is more general and flexible than the lognormal distribution that has been used in past research, and also relaxes the assumption of a uniform distribution of income within quintile and decile groups of populations. The paper studies the nature and extent of inequality in rural and urban China over the period 1978 to 2002. Income inequality in the whole of China is then modeled using a mixture of province-specific distributions. The estimated results are used to study the trends in national inequality, and to discuss the empirical findings in the light of economic reforms, regional policies, and globalization of the Chinese economy.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

A data warehouse is a data repository which collects and maintains a large amount of data from multiple distributed, autonomous and possibly heterogeneous data sources. Often the data is stored in the form of materialized views in order to provide fast access to the integrated data. One of the most important decisions in designing a data warehouse is the selection of views for materialization. The objective is to select an appropriate set of views that minimizes the total query response time with the constraint that the total maintenance time for these materialized views is within a given bound. This view selection problem is totally different from the view selection problem under the disk space constraint. In this paper the view selection problem under the maintenance time constraint is investigated. Two efficient, heuristic algorithms for the problem are proposed. The key to devising the proposed algorithms is to define good heuristic functions and to reduce the problem to some well-solved optimization problems. As a result, an approximate solution of the known optimization problem will give a feasible solution of the original problem. (C) 2001 Elsevier Science B.V. All rights reserved.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

A presente dissertação tem como principal objectivo estimar as emissões de carbono resultantes das actividades da Monteiro, Ribas- Embalagens Flexíveis, S.A. A realização do inventário de gases de efeito estufa permite que a Monteiro, Ribas- Embalagens Flexíveis, S.A, identifique quais as suas fontes emissoras e quantifique as emissões de gases de efeito estufa, permitindo criar estratégias de redução das mesmas. A elaboração do inventário foi fundamentada nas directrizes do Greenhouse Gas Protocol, obedecendo aos princípios de relevância, integrabilidade, consistência, transparência e exactidão. A metodologia adoptada utiliza factores de emissão documentados para efectuar o cálculo das emissões de gases de efeito de estufa (GEE). Estes factores são rácios que relacionam as emissões de GEE com dados de actividade específicos para cada fonte de emissão. Como emissões directas (âmbito 1), foram quantificadas as emissões provenientes do uso de gás natural nas caldeiras, consumo de vapor e de água quente, e as emissões do veículo comercial da empresa. Como emissões indirectas de âmbito 2, incluem-se as resultantes da electricidade consumida. As emissões indirectas estimadas de âmbito 3 referem-se, no caso em estudo, ao transporte de resíduos, ao deslocamento de funcionários para a empresa e às viagens de negócio. Face ao tipo de emissões identificadas, criou-se uma ferramenta de cálculo que contém todos os valores de factores de emissão que podem ser utilizados em função das características específicas dos dados de actividade relativos às várias fontes emissoras da Empresa. Esta ferramenta permitirá, no futuro, aperfeiçoar o cálculo das emissões, a partir de uma melhor sistematização da informação disponível. Com este trabalho também foi possível identificar a necessidade de recolher e organizar alguma informação complementar à já existente. O ano base considerado foi 2011. Os resultados obtidos mostram que, neste ano, as actividades da Monteiro, Ribas- Embalagens Flexíveis, S.A serão responsáveis pela emissão de 2968,6 toneladas de CO2e (dióxido de carbono equivalente). De acordo com a Decisão 2007/589/CE da Comissão de 18 de Julho de 2007 conclui-se que a Monteiro, Ribas Embalagens e Flexíveis S.A. se enquadra na categoria de instalações com baixo níveis de emissões pois as suas emissões médias anuais são inferiores a 25000 toneladas de CO2e. Conclui-se que a percentagem maior das emissões estimadas (50,7 %) é proveniente do consumo de electricidade (emissões indirectas, âmbito 2), seguida pelo consumo de gás natural (emissões directas) que representa 39,4% das emissões. Relacionando os resultados obtidos com a produção total da Monteiro, Ribas- Embalagens Flexíveis, S.A, em 2011, obtém-se o valor de 0,65 kg de CO2e por cada quilograma de produto final. Algumas das fontes emissoras identificadas não foram incorporadas no inventário da empresa, nomeadamente o transporte das matérias-primas e dos produtos. Isto deve-se ao facto de não ter sido possível compilar a informação necessária, em tempo útil. Apesar de se tratar de emissões indirectas de âmbito 3, consideradas opcionais, recomenda-se que num próximo trabalho deste tipo, essas emissões possam vir a ser quantificadas. As principais incertezas associadas às estimativas de emissão dizem respeito aos dados de actividade uma vez que foi a primeira vez que a empresa realizou um inventário de gases de efeito de estufa. Há informações mais específicas sobre os dados de actividade que a empresa dispõe e que poderá, de futuro, sistematizar de uma forma mais adequada para a sua utilização com este fim.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Tecnologias da Web Semântica como RDF, OWL e SPARQL sofreram nos últimos anos um forte crescimento e aceitação. Projectos como a DBPedia e Open Street Map começam a evidenciar o verdadeiro potencial da Linked Open Data. No entanto os motores de pesquisa semânticos ainda estão atrasados neste crescendo de tecnologias semânticas. As soluções disponíveis baseiam-se mais em recursos de processamento de linguagem natural. Ferramentas poderosas da Web Semântica como ontologias, motores de inferência e linguagens de pesquisa semântica não são ainda comuns. Adicionalmente a esta realidade, existem certas dificuldades na implementação de um Motor de Pesquisa Semântico. Conforme demonstrado nesta dissertação, é necessária uma arquitectura federada de forma a aproveitar todo o potencial da Linked Open Data. No entanto um sistema federado nesse ambiente apresenta problemas de performance que devem ser resolvidos através de cooperação entre fontes de dados. O standard actual de linguagem de pesquisa na Web Semântica, o SPARQL, não oferece um mecanismo para cooperação entre fontes de dados. Esta dissertação propõe uma arquitectura federada que contém mecanismos que permitem cooperação entre fontes de dados. Aborda o problema da performance propondo um índice gerido de forma centralizada assim como mapeamentos entre os modelos de dados de cada fonte de dados. A arquitectura proposta é modular, permitindo um crescimento de repositórios e funcionalidades simples e de forma descentralizada, à semelhança da Linked Open Data e da própria World Wide Web. Esta arquitectura trabalha com pesquisas por termos em linguagem natural e também com inquéritos formais em linguagem SPARQL. No entanto os repositórios considerados contêm apenas dados em formato RDF. Esta dissertação baseia-se em múltiplas ontologias partilhadas e interligadas.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In global scientific experiments with collaborative scenarios involving multinational teams there are big challenges related to data access, namely data movements are precluded to other regions or Clouds due to the constraints on latency costs, data privacy and data ownership. Furthermore, each site is processing local data sets using specialized algorithms and producing intermediate results that are helpful as inputs to applications running on remote sites. This paper shows how to model such collaborative scenarios as a scientific workflow implemented with AWARD (Autonomic Workflow Activities Reconfigurable and Dynamic), a decentralized framework offering a feasible solution to run the workflow activities on distributed data centers in different regions without the need of large data movements. The AWARD workflow activities are independently monitored and dynamically reconfigured and steering by different users, namely by hot-swapping the algorithms to enhance the computation results or by changing the workflow structure to support feedback dependencies where an activity receives feedback output from a successor activity. A real implementation of one practical scenario and its execution on multiple data centers of the Amazon Cloud is presented including experimental results with steering by multiple users.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Trabalho apresentado no âmbito do Mestrado em Engenharia Informática, como requisito parcial para obtenção do grau de Mestre em Engenharia Informática

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Feature selection is a central problem in machine learning and pattern recognition. On large datasets (in terms of dimension and/or number of instances), using search-based or wrapper techniques can be cornputationally prohibitive. Moreover, many filter methods based on relevance/redundancy assessment also take a prohibitively long time on high-dimensional. datasets. In this paper, we propose efficient unsupervised and supervised feature selection/ranking filters for high-dimensional datasets. These methods use low-complexity relevance and redundancy criteria, applicable to supervised, semi-supervised, and unsupervised learning, being able to act as pre-processors for computationally intensive methods to focus their attention on smaller subsets of promising features. The experimental results, with up to 10(5) features, show the time efficiency of our methods, with lower generalization error than state-of-the-art techniques, while being dramatically simpler and faster.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Esta dissertação incide sobre a problemática da construção de um data warehouse para a empresa AdClick que opera na área de marketing digital. O marketing digital é um tipo de marketing que utiliza os meios de comunicação digital, com a mesma finalidade do método tradicional que se traduz na divulgação de bens, negócios e serviços e a angariação de novos clientes. Existem diversas estratégias de marketing digital tendo em vista atingir tais objetivos, destacando-se o tráfego orgânico e tráfego pago. Onde o tráfego orgânico é caracterizado pelo desenvolvimento de ações de marketing que não envolvem quaisquer custos inerentes à divulgação e/ou angariação de potenciais clientes. Por sua vez o tráfego pago manifesta-se pela necessidade de investimento em campanhas capazes de impulsionar e atrair novos clientes. Inicialmente é feita uma abordagem do estado da arte sobre business intelligence e data warehousing, e apresentadas as suas principais vantagens as empresas. Os sistemas business intelligence são necessários, porque atualmente as empresas detêm elevados volumes de dados ricos em informação, que só serão devidamente explorados fazendo uso das potencialidades destes sistemas. Nesse sentido, o primeiro passo no desenvolvimento de um sistema business intelligence é concentrar todos os dados num sistema único integrado e capaz de dar apoio na tomada de decisões. É então aqui que encontramos a construção do data warehouse como o sistema único e ideal para este tipo de requisitos. Nesta dissertação foi elaborado o levantamento das fontes de dados que irão abastecer o data warehouse e iniciada a contextualização dos processos de negócio existentes na empresa. Após este momento deu-se início à construção do data warehouse, criação das dimensões e tabelas de factos e definição dos processos de extração e carregamento dos dados para o data warehouse. Assim como a criação das diversas views. Relativamente ao impacto que esta dissertação atingiu destacam-se as diversas vantagem a nível empresarial que a empresa parceira neste trabalho retira com a implementação do data warehouse e os processos de ETL para carregamento de todas as fontes de informação. Sendo que algumas vantagens são a centralização da informação, mais flexibilidade para os gestores na forma como acedem à informação. O tratamento dos dados de forma a ser possível a extração de informação a partir dos mesmos.