971 resultados para Aggregated data
Resumo:
Since 2006, we have been conducting urban informatics research that we define as “the study, design, and practice of urban experiences across different urban contexts that are created by new opportunities of real-time, ubiquitous technology and the augmentation that mediates the physical and digital layers of people networks and urban infrastructures” [1]. Various new research initiatives under the label “urban informatics” have been started since then by universities (e.g., NYU’s Center for Urban Science and Progress) and industry (e.g., Arup, McKinsey) worldwide. Yet, many of these new initiatives are limited to what Townsend calls, “data-driven approaches to urban improvement” [2]. One of the key challenges is that any quantity of aggregated data does not easily translate directly into quality insights to better understand cities. In this talk, I will raise questions about the purpose of urban informatics research beyond data, and show examples of media architecture, participatory city making, and citizen activism. I argue for (1) broadening the disciplinary foundations that urban science approaches draw on; (2) maintaining a hybrid perspective that considers both the bird’s eye view as well as the citizen’s view, and; (3) employing design research to not be limited to just understanding, but to bring about actionable knowledge that will drive change for good.
Resumo:
This work aims to compare the forecast efficiency of different types of methodologies applied to Brazilian Consumer inflation (IPCA). We will compare forecasting models using disaggregated and aggregated data over twelve months ahead. The disaggregated models were estimated by SARIMA and will have different levels of disaggregation. Aggregated models will be estimated by time series techniques such as SARIMA, state-space structural models and Markov-switching. The forecasting accuracy comparison will be made by the selection model procedure known as Model Confidence Set and by Diebold-Mariano procedure. We were able to find evidence of forecast accuracy gains in models using more disaggregated data
Resumo:
The current state of health and biomedicine includes an enormity of heterogeneous data ‘silos’, collected for different purposes and represented differently, that are presently impossible to share or analyze in toto. The greatest challenge for large-scale and meaningful analyses of health-related data is to achieve a uniform data representation for data extracted from heterogeneous source representations. Based upon an analysis and categorization of heterogeneities, a process for achieving comparable data content by using a uniform terminological representation is developed. This process addresses the types of representational heterogeneities that commonly arise in healthcare data integration problems. Specifically, this process uses a reference terminology, and associated "maps" to transform heterogeneous data to a standard representation for comparability and secondary use. The capture of quality and precision of the “maps” between local terms and reference terminology concepts enhances the meaning of the aggregated data, empowering end users with better-informed queries for subsequent analyses. A data integration case study in the domain of pediatric asthma illustrates the development and use of a reference terminology for creating comparable data from heterogeneous source representations. The contribution of this research is a generalized process for the integration of data from heterogeneous source representations, and this process can be applied and extended to other problems where heterogeneous data needs to be merged.
Resumo:
Suicide has drawn much attention from both the scientific community and the public. Examining the impact of socio-environmental factors on suicide is essential in developing suicide prevention strategies and interventions, because it will provide health authorities with important information for their decision-making. However, previous studies did not examine the impact of socio-environmental factors on suicide using a spatial analysis approach. The purpose of this study was to identify the patterns of suicide and to examine how socio-environmental factors impact on suicide over time and space at the Local Governmental Area (LGA) level in Queensland. The suicide data between 1999 and 2003 were collected from the Australian Bureau of Statistics (ABS). Socio-environmental variables at the LGA level included climate (rainfall, maximum and minimum temperature), Socioeconomic Indexes for Areas (SEIFA) and demographic variables (proportion of Indigenous population, unemployment rate, proportion of population with low income and low education level). Climate data were obtained from Australian Bureau of Meteorology. SEIFA and demographic variables were acquired from ABS. A series of statistical and geographical information system (GIS) approaches were applied in the analysis. This study included two stages. The first stage used average annual data to view the spatial pattern of suicide and to examine the association between socio-environmental factors and suicide over space. The second stage examined the spatiotemporal pattern of suicide and assessed the socio-environmental determinants of suicide, using more detailed seasonal data. In this research, 2,445 suicide cases were included, with 1,957 males (80.0%) and 488 females (20.0%). In the first stage, we examined the spatial pattern and the determinants of suicide using 5-year aggregated data. Spearman correlations were used to assess associations between variables. Then a Poisson regression model was applied in the multivariable analysis, as the occurrence of suicide is a small probability event and this model fitted the data quite well. Suicide mortality varied across LGAs and was associated with a range of socio-environmental factors. The multivariable analysis showed that maximum temperature was significantly and positively associated with male suicide (relative risk [RR] = 1.03, 95% CI: 1.00 to 1.07). Higher proportion of Indigenous population was accompanied with more suicide in male population (male: RR = 1.02, 95% CI: 1.01 to 1.03). There was a positive association between unemployment rate and suicide in both genders (male: RR = 1.04, 95% CI: 1.02 to 1.06; female: RR = 1.07, 95% CI: 1.00 to 1.16). No significant association was observed for rainfall, minimum temperature, SEIFA, proportion of population with low individual income and low educational attainment. In the second stage of this study, we undertook a preliminary spatiotemporal analysis of suicide using seasonal data. Firstly, we assessed the interrelations between variables. Secondly, a generalised estimating equations (GEE) model was used to examine the socio-environmental impact on suicide over time and space, as this model is well suited to analyze repeated longitudinal data (e.g., seasonal suicide mortality in a certain LGA) and it fitted the data better than other models (e.g., Poisson model). The suicide pattern varied with season and LGA. The north of Queensland had the highest suicide mortality rate in all the seasons, while there was no suicide case occurred in the southwest. Northwest had consistently higher suicide mortality in spring, autumn and winter. In other areas, suicide mortality varied between seasons. This analysis showed that maximum temperature was positively associated with suicide among male population (RR = 1.24, 95% CI: 1.04 to 1.47) and total population (RR = 1.15, 95% CI: 1.00 to 1.32). Higher proportion of Indigenous population was accompanied with more suicide among total population (RR = 1.16, 95% CI: 1.13 to 1.19) and by gender (male: RR = 1.07, 95% CI: 1.01 to 1.13; female: RR = 1.23, 95% CI: 1.03 to 1.48). Unemployment rate was positively associated with total (RR = 1.40, 95% CI: 1.24 to 1.59) and female (RR=1.09, 95% CI: 1.01 to 1.18) suicide. There was also a positive association between proportion of population with low individual income and suicide in total (RR = 1.28, 95% CI: 1.10 to 1.48) and male (RR = 1.45, 95% CI: 1.23 to 1.72) population. Rainfall was only positively associated with suicide in total population (RR = 1.11, 95% CI: 1.04 to 1.19). There was no significant association for rainfall, minimum temperature, SEIFA, proportion of population with low educational attainment. The second stage is the extension of the first stage. Different spatial scales of dataset were used between the two stages (i.e., mean yearly data in the first stage, and seasonal data in the second stage), but the results are generally consistent with each other. Compared with other studies, this research explored the variety of the impact of a wide range of socio-environmental factors on suicide in different geographical units. Maximum temperature, proportion of Indigenous population, unemployment rate and proportion of population with low individual income were among the major determinants of suicide in Queensland. However, the influence from other factors (e.g. socio-culture background, alcohol and drug use) influencing suicide cannot be ignored. An in-depth understanding of these factors is vital in planning and implementing suicide prevention strategies. Five recommendations for future research are derived from this study: (1) It is vital to acquire detailed personal information on each suicide case and relevant information among the population in assessing the key socio-environmental determinants of suicide; (2) Bayesian model could be applied to compare mortality rates and their socio-environmental determinants across LGAs in future research; (3) In the LGAs with warm weather, high proportion of Indigenous population and/or unemployment rate, concerted efforts need to be made to control and prevent suicide and other mental health problems; (4) The current surveillance, forecasting and early warning system needs to be strengthened, to trace the climate and socioeconomic change over time and space and its impact on population health; (5) It is necessary to evaluate and improve the facilities of mental health care, psychological consultation, suicide prevention and control programs; especially in the areas with low socio-economic status, high unemployment rate, extreme weather events and natural disasters.
Resumo:
This paper investigates relationship between traffic conditions and the crash occurrence likelihood (COL) using the I-880 data. To remedy the data limitations and the methodological shortcomings suffered by previous studies, a multiresolution data processing method is proposed and implemented, upon which binary logistic models were developed. The major findings of this paper are: 1) traffic conditions have significant impacts on COL at the study site; Specifically, COL in a congested (transitioning) traffic flow is about 6 (1.6) times of that in a free flow condition; 2)Speed variance alone is not sufficient to capture traffic dynamics’ impact on COL; a traffic chaos indicator that integrates speed, speed variance, and flow is proposed and shows a promising performance; 3) Models based on aggregated data shall be interpreted with caution. Generally, conclusions obtained from such models shall not be generalized to individual vehicles (drivers) without further evidences using high-resolution data and it is dubious to either claim or disclaim speed kills based on aggregated data.
Resumo:
description and analysis of geographically indexed health data with respect to demographic, environmental, behavioural, socioeconomic, genetic, and infectious risk factors (Elliott andWartenberg 2004). Disease maps can be useful for estimating relative risk; ecological analyses, incorporating area and/or individual-level covariates; or cluster analyses (Lawson 2009). As aggregated data are often more readily available, one common method of mapping disease is to aggregate the counts of disease at some geographical areal level, and present them as choropleth maps (Devesa et al. 1999; Population Health Division 2006). Therefore, this chapter will focus exclusively on methods appropriate for areal data...
Resumo:
In recent years, thanks to developments in information technology, large-dimensional datasets have been increasingly available. Researchers now have access to thousands of economic series and the information contained in them can be used to create accurate forecasts and to test economic theories. To exploit this large amount of information, researchers and policymakers need an appropriate econometric model.Usual time series models, vector autoregression for example, cannot incorporate more than a few variables. There are two ways to solve this problem: use variable selection procedures or gather the information contained in the series to create an index model. This thesis focuses on one of the most widespread index model, the dynamic factor model (the theory behind this model, based on previous literature, is the core of the first part of this study), and its use in forecasting Finnish macroeconomic indicators (which is the focus of the second part of the thesis). In particular, I forecast economic activity indicators (e.g. GDP) and price indicators (e.g. consumer price index), from 3 large Finnish datasets. The first dataset contains a large series of aggregated data obtained from the Statistics Finland database. The second dataset is composed by economic indicators from Bank of Finland. The last dataset is formed by disaggregated data from Statistic Finland, which I call micro dataset. The forecasts are computed following a two steps procedure: in the first step I estimate a set of common factors from the original dataset. The second step consists in formulating forecasting equations including the factors extracted previously. The predictions are evaluated using relative mean squared forecast error, where the benchmark model is a univariate autoregressive model. The results are dataset-dependent. The forecasts based on factor models are very accurate for the first dataset (the Statistics Finland one), while they are considerably worse for the Bank of Finland dataset. The forecasts derived from the micro dataset are still good, but less accurate than the ones obtained in the first case. This work leads to multiple research developments. The results here obtained can be replicated for longer datasets. The non-aggregated data can be represented in an even more disaggregated form (firm level). Finally, the use of the micro data, one of the major contributions of this thesis, can be useful in the imputation of missing values and the creation of flash estimates of macroeconomic indicator (nowcasting).
Resumo:
Johnson's SB and the logit-logistic are four-parameter distribution models that may be obtained from the standard normal and logistic distributions by a four-parameter transformation. For relatively small data sets, such as diameter at breast height measurements obtained from typical sample plots, distribution models with four or less parameters have been found to be empirically adequate. However, in situations in which the distributions are complex, for example in mixed stands or when the stand has been thinned or when working with aggregated data, then distribution models with more shape parameters may prove to be necessary. By replacing the symmetric standard logistic distribution of the logit-logistic with a one-parameter “standard Richards” distribution and transforming by a five-parameter Richards function, we obtain a new six-parameter distribution model, the “Richit-Richards”. The Richit-Richards includes the “logit-Richards”, the “Richit-logistic”, and the logit-logistic as submodels. Maximum likelihood estimation is used to fit the model, and some problems in the maximum likelihood estimation of bounding parameters are discussed. An empirical case study of the Richit-Richards and its submodels is conducted on pooled diameter at breast height data from 107 sample plots of Chinese fir (Cunninghamia lanceolata (Lamb.) Hook.). It is found that the new models provide significantly better fits than the four-parameter logit-logistic for large data sets.
Resumo:
Wireless Sensor Networks (WSN) are being used for a number of applications involving infrastructure monitoring, building energy monitoring and industrial sensing. The difficulty of programming individual sensor nodes and the associated overhead have encouraged researchers to design macro-programming systems which can help program the network as a whole or as a combination of subnets. Most of the current macro-programming schemes do not support multiple users seamlessly deploying diverse applications on the same shared sensor network. As WSNs are becoming more common, it is important to provide such support, since it enables higher-level optimizations such as code reuse, energy savings, and traffic reduction. In this paper, we propose a macro-programming framework called Nano-CF, which, in addition to supporting in-network programming, allows multiple applications written by different programmers to be executed simultaneously on a sensor networking infrastructure. This framework enables the use of a common sensing infrastructure for a number of applications without the users having to worrying about the applications already deployed on the network. The framework also supports timing constraints and resource reservations using the Nano-RK operating system. Nano- CF is efficient at improving WSN performance by (a) combining multiple user programs, (b) aggregating packets for data delivery, and (c) satisfying timing and energy specifications using Rate- Harmonized Scheduling. Using representative applications, we demonstrate that Nano-CF achieves 90% reduction in Source Lines-of-Code (SLoC) and 50% energy savings from aggregated data delivery.
Resumo:
RESUMO - Introdução: A despesa em saúde aumentou consideravelmente nas últimas décadas na maioria dos países industrializados. Por outro lado, os indicadores de saúde melhoraram. A evidência empírica sobre a relação entre as despesas em saúde e a saúde das populações tem sido inconclusiva. Este estudo aborda a relação entre as despesas em saúde e a saúde das populações através de dados agregados para 34 países para o período 1980-2010. Metodologia: Utilizou-se o coeficiente de correlação de Pearson para avaliar a correlação entre as variáveis explicativas e os indicadores de saúde. Procedeuse ainda à realização de uma regressão multivariada com dados em painel para cada indicador de saúde utilizado como variável dependente: esperança de vida à nascença e aos 65 anos para mulheres e homens, anos de vida potencialmente perdidos para mulheres e homens e mortalidade infantil. A principal variável explicativa utilizada foi a despesa em saúde, mas consideraram-se também vários fatores de confundimento, nomeadamente a riqueza, fatores estilo de vida, e oferta de cuidados. Resultados: A despesa per capita tem impacto nos indicadores de saúde mas ao adicionarmos a variável PIB per capita deixa de ser estatisticamente significativa. Outros fatores têm um impacto significativo para quase todos os indicadores de saúde utilizados: consumo de álcool e tabaco, gordura, o número de médicos e a imunização, confirmando vários resultados da literatura. Conclusão: Os resultados vão ao encontro de alguns estudos que afirmam o impacto marginal das despesas em saúde e do progresso da medicina nos resultados em saúde desde os anos 80 nos países industrializados.
Resumo:
L’explosion récente du nombre de centenaires dans les pays à faible mortalité n’est pas étrangère à la multiplication des études portant sur la longévité, et plus spécifiquement sur ses déterminants et ses répercussions. Alors que certains tentent de découvrir les gènes pouvant être responsables de la longévité extrême, d’autres s’interrogent sur l’impact social, économique et politique du vieillissement de la population et de l’augmentation de l’espérance de vie ou encore, sur l’existence d’une limite biologique à la vie humaine. Dans le cadre de cette thèse, nous analysons la situation démographique des centenaires québécois depuis le début du 20e siècle à partir de données agrégées (données de recensement, statistiques de l’état civil, estimations de population). Dans un deuxième temps, nous évaluons la qualité des données québécoises aux grands âges à partir d’une liste nominative des décès de centenaires des générations 1870-1894. Nous nous intéressons entre autres aux trajectoires de mortalité au-delà de cent ans. Finalement, nous analysons la survie des frères, sœurs et parents d’un échantillon de semi-supercentenaires (105 ans et plus) nés entre 1890 et 1900 afin de se prononcer sur la composante familiale de la longévité. Cette thèse se compose de trois articles. Dans le cadre du premier, nous traitons de l’évolution du nombre de centenaires au Québec depuis les années 1920. Sur la base d’indicateurs démographiques tels le ratio de centenaires, les probabilités de survie et l’âge maximal moyen au décès, nous mettons en lumière les progrès remarquables qui ont été réalisés en matière de survie aux grands âges. Nous procédons également à la décomposition des facteurs responsables de l’augmentation du nombre de centenaires au Québec. Ainsi, au sein des facteurs identifiés, l’augmentation de la probabilité de survie de 80 à 100 ans s’inscrit comme principal déterminant de l’accroissement du nombre de centenaires québécois. Le deuxième article traite de la validation des âges au décès des centenaires des générations 1870-1894 d’origine canadienne-française et de confession catholique nés et décédés au Québec. Au terme de ce processus de validation, nous pouvons affirmer que les données québécoises aux grands âges sont d’excellente qualité. Les trajectoires de mortalité des centenaires basées sur les données brutes s’avèrent donc représentatives de la réalité. L’évolution des quotients de mortalité à partir de 100 ans témoigne de la décélération de la mortalité. Autant chez les hommes que chez les femmes, les quotients de mortalité plafonnent aux alentours de 45%. Finalement, dans le cadre du troisième article, nous nous intéressons à la composante familiale de la longévité. Nous comparons la survie des frères, sœurs et parents des semi-supercentenaires décédés entre 1995 et 2004 à celle de leurs cohortes de naissance respectives. Les différences de survie entre les frères, sœurs et parents des semi-supercentenaires sous observation et leur génération « contrôle » s’avèrent statistiquement significatives à un seuil de 0,01%. De plus, les frères, sœurs, pères et mères des semi-supercentenaires ont entre 1,7 (sœurs) et 3 fois (mères) plus de chance d’atteindre 90 ans que les membres de leur cohorte de naissance correspondante. Ainsi, au terme de ces analyses, il ne fait nul doute que la longévité se concentre au sein de certaines familles.
Resumo:
The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. This work proposes a fully decentralised algorithm (Epidemic K-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art distributed K-Means algorithms based on sampling methods. The experimental analysis confirms that the proposed algorithm is a practical and accurate distributed K-Means implementation for networked systems of very large and extreme scale.
Resumo:
The issue of diversification in direct real estate investment portfolios has been widely studied in academic and practitioner literature. Most work, however, has been done using either partially aggregated data or data for small samples of individual properties. This paper reports results from tests of both risk reduction and diversification that use the records of 10,000+ UK properties tracked by Investment Property Databank. It provides, for the first time, robust estimates of the diversification gains attainable given the returns, risks and cross‐correlations across the individual properties available to fund managers. The results quantify the number of assets and amount of money needed to construct both ‘balanced’ and ‘specialist’ property portfolios by direct investment. Target numbers will vary according to the objectives of investors and the degree to which tracking error is tolerated. The top‐level results are consistent with previous work, showing that a large measure of risk reduction can be achieved with portfolios of 30–50 properties, but full diversification of specific risk can only be achieved in very large portfolios. However, the paper extends previous work by demonstrating on a single, large dataset the implications of different methods of calculating risk reduction, and also by showing more disaggregated results relevant to the construction of specialist, sector‐focussed funds.
Resumo:
The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks, such as massively parallel processors and clusters of workstations. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. The lack of scalable and fault tolerant global communication and synchronisation methods in large-scale systems has hindered the adoption of the K-Means algorithm for applications in large networked systems such as wireless sensor networks, peer-to-peer systems and mobile ad hoc networks. This work proposes a fully distributed K-Means algorithm (EpidemicK-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art sampling methods and shows that the proposed method overcomes the limitations of the sampling-based approaches for skewed clusters distributions. The experimental analysis confirms that the proposed algorithm is very accurate and fault tolerant under unreliable network conditions (message loss and node failures) and is suitable for asynchronous networks of very large and extreme scale.
Resumo:
The domestic (residential) sector accounts for 30% of the world’s energy consumption hence plays a substantial role in energy management and CO2 emissions reduction efforts. Energy models have been generally developed to mitigate the impact of climate change and for the sustainable management and planning of energy resources. Although there are different models and model categories, they are generally categorised into top down and bottom up. Significantly, top down models are based on aggregated data while bottom up models are based on disaggregated data. These approaches create fundamental differences which have been the centre of debate since the 1970’s. These differences have led to noticeable discrepancies in results which have led to authors arguing that the models are of a more complementary than a substituting nature. As a result developing methods suggest that there is the need to integrate either the two models (bottom up − top down) or aspects that combine two bottom up models or an upgrade of top down models to compensate for the documented limitations. Diverse schools of thought argue in favour of these integrations – currently known as hybrid models. In this paper complexities of identifying country specific and/or generic domestic energy models and their applications in different countries have been critically reviewed. Predominantly from the review it is evident that most of these methods have been adapted and used in the ‘western world’ with practically no such applications in Africa.