Biblioteca Digital

938 resultados para alta risoluzione Trentino Alto Adige data-set climatologia temperatura giornaliera orografia complessa

How does employment density influence individuals’ wages? : A micro data approach.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We estimate the effect of employment density on wages in Sweden in a large geocoded data set on individuals and workplaces. Employment density is measured in four circular zones around each individual’s place of living. The data contains a rich set of control variables that we use in an instrumental variables framework. Results show a relatively strong but rather local positive effect of employment density on wages. Beyond 5 kilometers the effect becomes negative. This might indicate that the effect of agglomeration economies falls faster with distance than the effects of congestion.

The need for low bias algorithms in classification learning from large data sets

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper reviews the appropriateness for application to large data sets of standard machine learning algorithms, which were mainly developed in the context of small data sets. Sampling and parallelisation have proved useful means for reducing computation time when learning from large data sets. However, such methods assume that algorithms that were designed for use with what are now considered small data sets are also fundamentally suitable for large data sets. It is plausible that optimal learning from large data sets requires a different type of algorithm to optimal learning from small data sets. This paper investigates one respect in which data set size may affect the requirements of a learning algorithm — the bias plus variance decomposition of classification error. Experiments show that learning from large data sets may be more effective when using an algorithm that places greater emphasis on bias management, rather than variance management.

Discovering linear causal model from incomplete data

Relevância:

100.00% 100.00%

Publicador:

Resumo:

One common drawback in algorithms for learning Linear Causal Models is that they can not deal with incomplete data set. This is unfortunate since many real problems involve missing data or even hidden variable. In this paper, based on multiple imputation, we propose a three-step process to learn linear causal models from incomplete data set. Experimental results indicate that this algorithm is better than the single imputation method (EM algorithm) and the simple list deletion method, and for lower missing rate, this algorithm can even find models better than the results from the greedy learning algorithm MLGS working in a complete data set. In addition, the method is amenable to parallel or distributed processing, which is an important characteristic for data mining in large data sets.

International comparisons of rural-urban educational attainment : data and determinants

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We study cross-country differences in rural and urban educational attainment by using a data set comprising 56 countries. We focus on the determinants of rural–urban educational inequality, which is measured by the ratio of rural to urban average years of schooling within each country. We find that riskier human capital investment, less credit availability, a colonial heritage, a legal system of French origin and landlockedness of nations are all associated with relatively lower rural educational levels and greater rural–urban educational inequality. Conversely, larger formal labor markets, better infrastructure and a legal system of British origin are associated with relatively higher rural educational levels and lower rural–urban educational inequality. We also identify an interaction effect between economic development level and some of these factors. In particular, we find that as development level increases, the negative (positive) relationship between French (British) legal systems and rural–urban educational inequality is reversed and becomes positive (negative).

Prediction of paediatric asthma hospitalisation using data mining techniques

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Research into the prevalence of hospitalisation among childhood asthma cases is undertaken, using a data set local to the Barwon region of Victoria. Participants were the parents/guardians on behalf of children aged between 5-11 years. Various data mining techniques are used, including segmentation, association and classification to assist in predicting and exploring the instances of childhood hospitalisation due to asthma. Results from this study indicate that children in inner city and metropolitan areas may overutilise emergency department services. In addition, this study found that the prediction of hospitalisaion for asthma in children was greater for those with a written asthma management plan.

Learning from large data : bias, variance, sampling, and learning curves

Relevância:

100.00% 100.00%

Publicador:

Resumo:

One of the fundamental machine learning tasks is that of predictive classification. Given that organisations collect an ever increasing amount of data, predictive classification methods must be able to effectively and efficiently handle large amounts of data. However, it is understood that present requirements push existing algorithms to, and sometimes beyond, their limits since many classification prediction algorithms were designed when currently common data set sizes were beyond imagination. This has led to a significant amount of research into ways of making classification learning algorithms more effective and efficient. Although substantial progress has been made, a number of key questions have not been answered. This dissertation investigates two of these key questions. The first is whether different types of algorithms to those currently employed are required when using large data sets. This is answered by analysis of the way in which the bias plus variance decomposition of predictive classification error changes as training set size is increased. Experiments find that larger training sets require different types of algorithms to those currently used. Some insight into the characteristics of suitable algorithms is provided, and this may provide some direction for the development of future classification prediction algorithms which are specifically designed for use with large data sets. The second question investigated is that of the role of sampling in machine learning with large data sets. Sampling has long been used as a means of avoiding the need to scale up algorithms to suit the size of the data set by scaling down the size of the data sets to suit the algorithm. However, the costs of performing sampling have not been widely explored. Two popular sampling methods are compared with learning from all available data in terms of predictive accuracy, model complexity, and execution time. The comparison shows that sub-sampling generally products models with accuracy close to, and sometimes greater than, that obtainable from learning with all available data. This result suggests that it may be possible to develop algorithms that take advantage of the sub-sampling methodology to reduce the time required to infer a model while sacrificing little if any accuracy. Methods of improving effective and efficient learning via sampling are also investigated, and now sampling methodologies proposed. These methodologies include using a varying-proportion of instances to determine the next inference step and using a statistical calculation at each inference step to determine sufficient sample size. Experiments show that using a statistical calculation of sample size can not only substantially reduce execution time but can do so with only a small loss, and occasional gain, in accuracy. One of the common uses of sampling is in the construction of learning curves. Learning curves are often used to attempt to determine the optimal training size which will maximally reduce execution time while nut being detrimental to accuracy. An analysis of the performance of methods for detection of convergence of learning curves is performed, with the focus of the analysis on methods that calculate the gradient, of the tangent to the curve. Given that such methods can be susceptible to local accuracy plateaus, an investigation into the frequency of local plateaus is also performed. It is shown that local accuracy plateaus are a common occurrence, and that ensuring a small loss of accuracy often results in greater computational cost than learning from all available data. These results cast doubt over the applicability of gradient of tangent methods for detecting convergence, and of the viability of learning curves for reducing execution time in general.

Reconstructing the incidence of human immunodeficiency virus (HIV) in Hong Kong by using data from HIV positive tests and diagnoses of acquired immune deficiency syndrome

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The human immunodeficiency virus–acquired immune deficiency syndrome (HIV–AIDS) epidemic in Hong Kong has been under surveillance in the form of voluntary reporting since 1984. However, there has been little discussion or research on the reconstruction of the HIV incidence curve. This paper is the first to use a modified back-projection method to estimate the incidence of HIV in Hong Kong on the basis of the number of positive HIV tests only. The model proposed has several advantages over the original back-projection method based on AIDS data only. First, not all HIV-infected individuals will develop AIDS by the time of analysis, but some of them may undertake an HIV test; therefore, the HIV data set contains more information than the AIDS data set. Second, the HIV diagnosis curve usually has a smoother pattern than the AIDS diagnosis curve, as it is not affected by redefinition of AIDS. Third, the time to positive HIV diagnosis is unlikely to be affected by treatment effects, as it is unlikely that an individual receives medication before the diagnosis of HIV. Fourth, the induction period from HIV infection to the first HIV positive test is usually shorter than the incubation period which is from HIV infection to diagnosis of AIDS. With a shorter induction period, more information becomes available for estimating the HIV incidence curve. Finally, this method requires the number of positive HIV diagnoses only, which is readily available from HIV–AIDS surveillance systems in many countries. It is estimated that, in Hong Kong, the cumulative number of HIV infections during the period 1979–2000 is about 2600, whereas an estimate based only on AIDS data seems to give an underestimate.

Visiview : a system for the visualization of multi-dimensional data

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Results generated by simulation of computer systems are often presented as a multi-dimensional data set, where the number of dimensions may be greater than 4 if sufficient system parameters are modelled. This paper describes a visualization system intended to assist in understanding the relationship between, and effect upon system behavior of, the different values of the system parameters.

The system is applied to data that cannot be represented using a mesh or isosurface representation, and in general can only be represented as a cloud of points. The use of stereoscopic rendering and rapid interaction with the data are compared with regard to their value in providing insight into the nature of the data.

A number of techniques are implemented for displaying projections of the data set with up to 7 dimensions, and for allowing intuitive manipulation of the remaining dimensions. In this way the effect of changes in one variable in the presence of a number of others can be explored.

The use of these techniques, when applied to data from computer system simulation, results in an intuitive understanding of the effects of the system parameters on system behavior.

The productivity-wage and productivity-employment nexus : a panel data analysis of Indian manufacturing

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This article investigates the long-run relationship between labour productivity and employment, and between labour productivity and real wages in the case of the Indian manufacturing sector. The panel data set consists of 17 two-digit manufacturing industries for the period 1973–1974 to 1999–2001. We find that productivity-wages and productivity-employment are panel cointegrated for all industries. We find that both employment and real wages exert a positive effect on labour productivity. We argue that flexible labour market has a significant influence on manufacturing productivity, employment and real wages in the case of Indian manufacturing.

Disability support services 2005–06 - national data on services provided under the Commonwealth State/Territory Disability Agreement

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This report presents information on disability services collected from over 9,000 service outlets throughout Australia, which are funded under an agreement between the Australian and state/territory governments. These services aim to improve the quality of life of people with disability by providing support and assistance across a range of life activities. The report profiles the people with disability who use the services, the types of services they use and the supports they need (including information on their informal carers). Most information presented in this report is derived from the 2005–06 Commonwealth State/Territory Disability Agreement National Minimum Data Set (CSTDA NMDS) collection.

Use of circle-segments as a data visualization technique for feature selection in pattern classification

Relevância:

100.00% 100.00%

Publicador:

Resumo:

One of the issues associated with pattern classification using data based machine learning systems is the “curse of dimensionality”. In this paper, the circle-segments method is proposed as a feature selection method to identify important input features before the entire data set is provided for learning with machine learning systems. Specifically, four machine learning systems are deployed for classification, viz. Multilayer Perceptron (MLP), Support Vector Machine (SVM), Fuzzy ARTMAP (FAM), and k-Nearest Neighbour (kNN). The integration between the circle-segments method and the machine learning systems has been applied to two case studies comprising one benchmark and one real data sets. Overall, the results after feature selection using the circle segments method demonstrate improvements in performance even with more than 50% of the input features eliminated from the original data sets.

Towards an understanding of the impact of advertising on data leaks

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Recent investigations have determined that many Android applications in both official and non-official online markets expose details of the user's mobile phone without user consent. In this paper, for the first time in the research literature, we provide a full investigation of why such applications leak, how they leak and where the data is leaked to. In order to achieve this, we employ a combination of static and dynamic analysis based on examination of Java classes and application behaviour for a data set of 123 samples, all pre-determined as being free from malicious software. Despite the fact that anti-virus vendor software did not flag any of these samples as malware, approximately 10% of them are shown to leak data about the mobile phone to a third-party; applications from the official market appear to be just as susceptible to such leaks as applications from the non-official markets.

An effective deferentially private data releasing algorithm for decision tree

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Differential privacy is a strong definition for protecting individual privacy in data releasing and mining. However, it is a rigid definition introducing a large amount of noise to the original dataset, which significantly decreases the quality of data mining results. Recently, how to design a suitable data releasing algorithm for data mining purpose is a hot research area. In this paper, we propose a differential private data releasing algorithm for decision tree construction. The proposed algorithm provides a non-interactive data releasing method through which miner can obtain the complete dataset for data mining purpose. With a given privacy budget, the proposed algorithm generalizes the original dataset, and then specializes it in a differential privacy constrain to construct decision trees. As the designed novel scheme selection operation can fully utilize the allocated privacy budget, the data set released by the proposed algorithm can yield better decision tree models than other method. Experimental results demonstrate that the proposed algorithm outperforms existing methods for private decision tree construction.

Android applications : Data leaks via advertising libraries

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Recent studies have determined that many Android applications in both official and non-official online markets expose details of the users' smartphones without user consent. In this paper, we explain why such applications leak, how they leak and where the data is leaked to. In order to achieve this, we combine static and dynamic analysis to examine Java classes and application behaviour for a set of popular, clean applications from the Finance and Games categories. We observed that all the applications in our data set which leaked information (10%) had third-party advertising libraries embedded in their respective Java packages.

PPFSCADA: Privacy preserving framework for SCADA data publishing

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Supervisory Control and Data Acquisition (SCADA) systems control and monitor industrial and critical infrastructure functions, such as electricity, gas, water, waste, railway, and traffic. Recent attacks on SCADA systems highlight the need for stronger SCADA security. Thus, sharing SCADA traffic data has become a vital requirement in SCADA systems to analyze security risks and develop appropriate security solutions. However, inappropriate sharing and usage of SCADA data could threaten the privacy of companies and prevent sharing of data. In this paper, we present a privacy preserving strategy-based permutation technique called PPFSCADA framework, in which data privacy, statistical properties and data mining utilities can be controlled at the same time. In particular, our proposed approach involves: (i) vertically partitioning the original data set to improve the performance of perturbation; (ii) developing a framework to deal with various types of network traffic data including numerical, categorical and hierarchical attributes; (iii) grouping the portioned sets into a number of clusters based on the proposed framework; and (iv) the perturbation process is accomplished by the alteration of the original attribute value by a new value (clusters centroid). The effectiveness of the proposed PPFSCADA framework is shown through several experiments on simulated SCADA, intrusion detection and network traffic data sets. Through experimental analysis, we show that PPFSCADA effectively deals with multivariate traffic attributes, producing compatible results as the original data, and also substantially improving the performance of the five supervised approaches and provides high level of privacy protection. © 2014 Published by Elsevier B.V. All rights reserved.

«
1
2
...
15
16
17
18
19
20
21
...
62
63
»