969 resultados para on-disk data layout
Resumo:
Machine learning comprises a series of techniques for automatic extraction of meaningful information from large collections of noisy data. In many real world applications, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial information, they require an a priori form of preprocessing. Among all the learning techniques for dealing with structured data, kernel methods are recognized to have a strong theoretical background and to be effective approaches. They do not require an explicit vectorial representation of the data in terms of features, but rely on a measure of similarity between any pair of objects of a domain, the kernel function. Designing fast and good kernel functions is a challenging problem. In the case of tree structured data two issues become relevant: kernel for trees should not be sparse and should be fast to compute. The sparsity problem arises when, given a dataset and a kernel function, most structures of the dataset are completely dissimilar to one another. In those cases the classifier has too few information for making correct predictions on unseen data. In fact, it tends to produce a discriminating function behaving as the nearest neighbour rule. Sparsity is likely to arise for some standard tree kernel functions, such as the subtree and subset tree kernel, when they are applied to datasets with node labels belonging to a large domain. A second drawback of using tree kernels is the time complexity required both in learning and classification phases. Such a complexity can sometimes prevents the kernel application in scenarios involving large amount of data. This thesis proposes three contributions for resolving the above issues of kernel for trees. A first contribution aims at creating kernel functions which adapt to the statistical properties of the dataset, thus reducing its sparsity with respect to traditional tree kernel functions. Specifically, we propose to encode the input trees by an algorithm able to project the data onto a lower dimensional space with the property that similar structures are mapped similarly. By building kernel functions on the lower dimensional representation, we are able to perform inexact matchings between different inputs in the original space. A second contribution is the proposal of a novel kernel function based on the convolution kernel framework. Convolution kernel measures the similarity of two objects in terms of the similarities of their subparts. Most convolution kernels are based on counting the number of shared substructures, partially discarding information about their position in the original structure. The kernel function we propose is, instead, especially focused on this aspect. A third contribution is devoted at reducing the computational burden related to the calculation of a kernel function between a tree and a forest of trees, which is a typical operation in the classification phase and, for some algorithms, also in the learning phase. We propose a general methodology applicable to convolution kernels. Moreover, we show an instantiation of our technique when kernels such as the subtree and subset tree kernels are employed. In those cases, Direct Acyclic Graphs can be used to compactly represent shared substructures in different trees, thus reducing the computational burden and storage requirements.
Resumo:
The first part of my work consisted in samplings conduced in nine different localities of the salento peninsula and Apulia (Italy): Costa Merlata (BR), Punta Penne (BR), Santa Cesarea terme (LE), Santa Caterina (LE), Torre Inserraglio (LE), Torre Guaceto (BR), Porto Cesareo (LE), Otranto (LE), Isole Tremiti (FG). I collected data of species percentage covering from the infralittoral rocky zone, using squares of 50x50 cm. We considered 3 sites for location and 10 replicates for each site, which has been taken randomly. Then I took other data about the same places, collected in some years, and I combined them together, to do a spatial analysis. So I started from a data set of 1896 samples but I decided not to consider time as a factor because I have reason to think that in this period of time anthropogenic stressors and their effects (if present), didn’t change considerably. The response variable I’ve analysed is the covering percentage of an amount of 243 species (subsequently merged into 32 functional groups), including seaweeds, invertebrates, sediment and rock. 2 After the sampling, I have been spent a period of two months at the Hopkins Marine Station of Stanford University, in Monterey (California,USA), at Fiorenza Micheli's laboratory. I've been carried out statistical analysis on my data set, using the software PRIMER 6. My explorative analysis starts with a nMDS in PRIMER 6, considering the original data matrix without, for the moment, the effect of stressors. What comes out is a good separation between localities and it confirms the result of ANOSIM analysis conduced on the original data matrix. What is possible to ensure is that there is not a separation led by a geographic pattern, but there should be something else that leads the differences. Is clear the presence of at least three groups: one composed by Porto cesareo, Torre Guaceto and Isole tremiti (the only marine protected areas considered in this work); another one by Otranto, and the last one by the rest of little, impacted localities. Inside the localities that include MPA(Marine Protected Areas), is also possible to observe a sort of grouping between protected and controlled areas. What comes out from SIMPER analysis is that the most of the species involved in leading differences between populations are not rare species, like: Cystoseira spp., Mytilus sp. and ECR. Moreover I assigned discrete values (0,1,2) of each stressor to all the sites I considered, in relation to the intensity with which the anthropogenic factor affect the localities. 3 Then I tried to estabilish if there were some significant interactions between stressors: by using Spearman rank correlation and Spearman tables of significance, and taking into account 17 grades of freedom, the outcome shows some significant stressors interactions. Then I built a nMDS considering the stressors as response variable. The result was positive: localities are well separeted by stressors. Consequently I related the matrix with 'localities and species' with the 'localities and stressors' one. Stressors combination explains with a good significance level the variability inside my populations. I tried with all the possible data transformations (none, square root, fourth root, log (X+1), P/A), but the fourth root seemed to be the best one, with the highest level of significativity, meaning that also rare species can influence the result. The challenge will be to characterize better which kind of stressors (including also natural ones), act on the ecosystem; and give them a quantitative and more accurate values, trying to understand how they interact (in an additive or non-additive way).
Resumo:
The last decades have seen an unrivaled growth and diffusion of mobile telecommunications. Several standards have been developed to this purposes, from GSM mobile phone communications to WLAN IEEE 802.11, providing different services for the the transmission of signals ranging from voice to high data rate digital communications and Digital Video Broadcasting (DVB). In this wide research and market field, this thesis focuses on Ultra Wideband (UWB) communications, an emerging technology for providing very high data rate transmissions over very short distances. In particular the presented research deals with the circuit design of enabling blocks for MB-OFDM UWB CMOS single-chip transceivers, namely the frequency synthesizer and the transmission mixer and power amplifier. First we discuss three different models for the simulation of chargepump phase-locked loops, namely the continuous time s-domain and discrete time z-domain approximations and the exact semi-analytical time-domain model. The limitations of the two approximated models are analyzed in terms of error in the computed settling time as a function of loop parameters, deriving practical conditions under which the different models are reliable for fast settling PLLs up to fourth order. Besides, a phase noise analysis method based upon the time-domain model is introduced and compared to the results obtained by means of the s-domain model. We compare the three models over the simulation of a fast switching PLL to be integrated in a frequency synthesizer for WiMedia MB-OFDM UWB systems. In the second part, the theoretical analysis is applied to the design of a 60mW 3.4 to 9.2GHz 12 Bands frequency synthesizer for MB-OFDM UWB based on two wide-band PLLs. The design is presented and discussed up to layout level. A test chip has been implemented in TSMC CMOS 90nm technology, measured data is provided. The functionality of the circuit is proved and specifications are met with state-of-the-art area occupation and power consumption. The last part of the thesis deals with the design of a transmission mixer and a power amplifier for MB-OFDM UWB band group 1. The design has been carried on up to layout level in ST Microlectronics 65nm CMOS technology. Main characteristics of the systems are the wideband behavior (1.6 GHz of bandwidth) and the constant behavior over process parameters, temperature and supply voltage thanks to the design of dedicated adaptive biasing circuits.
Resumo:
Fog oases, locally named Lomas, are distributed in a fragmented way along the western coast of Chile and Peru (South America) between ~6°S and 30°S following an altitudinal gradient determined by a fog layer. This fragmentation has been attributed to the hyper aridity of the desert. However, periodically climatic events influence the ‘normal seasonality’ of this ecosystem through a higher than average water input that triggers plant responses (e.g. primary productivity and phenology). The impact of the climatic oscillation may vary according to the season (wet/dry). This thesis evaluates the potential effect of climate oscillations, such as El Niño Southern Oscillation (ENSO), through the analysis of vegetation of this ecosystem following different approaches: Chapters two and three show the analysis of fog oasis along the Peruvian and Chilean deserts. The objectives are: 1) to explain the floristic connection of fog oases analysing their taxa composition differences and the phylogenetic affinities among them, 2) to explore the climate variables related to ENSO which likely affect fog production, and the responses of Lomas vegetation (composition, productivity, distribution) to climate patterns during ENSO events. Chapters four and five describe a fog-oasis in southern Peru during the 2008-2010 period. The objectives are: 3) to describe and create a new vegetation map of the Lomas vegetation using remote sensing analysis supported by field survey data, and 4) to identify the vegetation change during the dry season. The first part of our results show that: 1) there are three significantly different groups of Lomas (Northern Peru, Southern Peru, and Chile) with a significant phylogenetic divergence among them. The species composition reveals a latitudinal gradient of plant assemblages. The species origin, growth-forms typologies, and geographic position also reinforce the differences among groups. 2) Contradictory results have emerged from studies of low-cloud anomalies and the fog-collection during El Niño (EN). EN increases water availability in fog oases when fog should be less frequent due to the reduction of low-clouds amount and stratocumulus. Because a minor role of fog during EN is expected, it is likely that measurements of fog-water collection during EN are considering drizzle and fog at the same time. Although recent studies on fog oases have shown some relationship with the ENSO, responses of vegetation have been largely based on descriptive data, the absence of large temporal records limit the establishment of a direct relationship with climatic oscillations. The second part of the results show that: 3) five different classes of different spectral values correspond to the main land cover of Lomas using a Vegetation Index (VI). The study case is characterised by shrubs and trees with variable cover (dense, semi-dense and open). A secondary area is covered by small shrubs where the dominant tree species is not present. The cacti area and the old terraces with open vegetation were not identified with the VI. Agriculture is present in the area. Finally, 4) contrary to the dry season of 2008 and 2009 years, a higher VI was obtained during the dry season of 2010. The VI increased up to three times their average value, showing a clear spectral signal change, which coincided with the ENSO event of that period.
Resumo:
The aim of the thesis is to propose a Bayesian estimation through Markov chain Monte Carlo of multidimensional item response theory models for graded responses with complex structures and correlated traits. In particular, this work focuses on the multiunidimensional and the additive underlying latent structures, considering that the first one is widely used and represents a classical approach in multidimensional item response analysis, while the second one is able to reflect the complexity of real interactions between items and respondents. A simulation study is conducted to evaluate the parameter recovery for the proposed models under different conditions (sample size, test and subtest length, number of response categories, and correlation structure). The results show that the parameter recovery is particularly sensitive to the sample size, due to the model complexity and the high number of parameters to be estimated. For a sufficiently large sample size the parameters of the multiunidimensional and additive graded response models are well reproduced. The results are also affected by the trade-off between the number of items constituting the test and the number of item categories. An application of the proposed models on response data collected to investigate Romagna and San Marino residents' perceptions and attitudes towards the tourism industry is also presented.
Resumo:
In this work we will discuss about a project started by the Emilia-Romagna Regional Government regarding the manage of the public transport. In particular we will perform a data mining analysis on the data-set of this project. After introducing the Weka software used to make our analysis, we will discover the most useful data mining techniques and algorithms; and we will show how these results can be used to violate the privacy of the same public transport operators. At the end, despite is off topic of this work, we will spend also a few words about how it's possible to prevent this kind of attack.
Resumo:
Whether the use of mobile phones is a risk factor for brain tumors in adolescents is currently being studied. Case--control studies investigating this possible relationship are prone to recall error and selection bias. We assessed the potential impact of random and systematic recall error and selection bias on odds ratios (ORs) by performing simulations based on real data from an ongoing case--control study of mobile phones and brain tumor risk in children and adolescents (CEFALO study). Simulations were conducted for two mobile phone exposure categories: regular and heavy use. Our choice of levels of recall error was guided by a validation study that compared objective network operator data with the self-reported amount of mobile phone use in CEFALO. In our validation study, cases overestimated their number of calls by 9% on average and controls by 34%. Cases also overestimated their duration of calls by 52% on average and controls by 163%. The participation rates in CEFALO were 83% for cases and 71% for controls. In a variety of scenarios, the combined impact of recall error and selection bias on the estimated ORs was complex. These simulations are useful for the interpretation of previous case-control studies on brain tumor and mobile phone use in adults as well as for the interpretation of future studies on adolescents.
Resumo:
In June 2008 the compulsary nationwide vaccination against BTV-8 (Bluetongue virus serotype 8) was started. After a short time, several owners complained about undesirable effects of the vaccination on fertility and milk quality. Data from 47 dairy farms, regularly supervised by herd health practitioners, were analysed in order to clarify a possible connection between vaccination and fertility. Both vaccinations given each cow for basic immunization were evaluated according to their effects on conception rate and pregnancy. In model calculations the first vaccination had no significant effect on the first service conception rate (FCR), the all service conception rate (ACR) and on the abortion rate. The second vaccination led to a significantly reduced FCR when the cow was inseminated within 20 days of being vaccinated and to a significantly worse ACR when inseminated 10 days before or after vaccination. However, these individually established reductions of the insemination rate had only little influence on overall data.
Resumo:
Data on antimicrobial use play a key role in the development of policies for the containment of antimicrobial resistance. On-farm data could provide a detailed overview of the antimicrobial use, but technical and methodological aspects of data collection and interpretation, as well as data quality need to be further assessed. The aims of this study were (1) to quantify antimicrobial use in the study population using different units of measurement and contrast the results obtained, (2) to evaluate data quality of farm records on antimicrobial use, and (3) to compare data quality of different recording systems. During 1 year, data on antimicrobial use were collected from 97 dairy farms. Antimicrobial consumption was quantified using: (1) the incidence density of antimicrobial treatments; (2) the weight of active substance; (3) the used daily dose and (4) the used course dose for antimicrobials for intestinal, intrauterine and systemic use; and (5) the used unit dose, for antimicrobials for intramammary use. Data quality was evaluated by describing completeness and accuracy of the recorded information, and by comparing farmers' and veterinarians' records. Relative consumption of antimicrobials depended on the unit of measurement: used doses reflected the treatment intensity better than weight of active substance. The use of antimicrobials classified as high priority was low, although under- and overdosing were frequently observed. Electronic recording systems allowed better traceability of the animals treated. Recording drug name or dosage often resulted in incomplete or inaccurate information. Veterinarians tended to record more drugs than farmers. The integration of veterinarian and farm data would improve data quality.
Resumo:
This thesis explores system performance for reconfigurable distributed systems and provides an analytical model for determining throughput of theoretical systems based on the OpenSPARC FPGA Board and the SIRC Communication Framework. This model was developed by studying a small set of variables that together determine a system¿s throughput. The importance of this model is in assisting system designers to make decisions as to whether or not to commit to designing a reconfigurable distributed system based on the estimated performance and hardware costs. Because custom hardware design and distributed system design are both time consuming and costly, it is important for designers to make decisions regarding system feasibility early in the development cycle. Based on experimental data the model presented in this paper shows a close fit with less than 10% experimental error on average. The model is limited to a certain range of problems, but it can still be used given those limitations and also provides a foundation for further development of modeling reconfigurable distributed systems.
Resumo:
Additions of nitrogen (N) have been shown to alter species diversity of plant communities, with most experimental studies having been carried out in communities dominated by herbaceous species. We examined seasonal and inter-annual patterns of change in the herbaceous layer of two watersheds of a central Appalachian hardwood forest that differed in experimental treatment. This study was carried out at the Fernow Experimental Forest, West Virginia, using two adjacent watersheds: WS4 (mature, second-growth hardwood stand, untreated reference), and WS3. Seven circular 0.04-ha sample plots were established in eachwatershed to represent its full range of elevation and slope aspect. The herbaceous layer was sampled by identifying and visually estimating cover (%) of all vascular plants. Sampling was carried out in mid-July of 1991 and repeated at approximately the same time in 1992. In 1994, these same plots were sampled each month fromMay to October. Seasonal patterns of herb layer dynamics were assessed for the complete 1994 data set, whereasinter-annual variability was based on plot data from 1991, 1992, and the July sample of 1994. There were nosignificant differences between watersheds for any sample year for any of the other herb layer characteristics measured, including herb layer cover, species richness, evenness, and diversity. Cover on WS4 decreased significantly from 1991 to 1992, followed by no change to 1994. By contrast, herb layer cover did not varysignificantly across years on WS3. Cover of the herbaceous layer of both watersheds increased from early in the growing season to the middle of the growing season, decreasing thereafter, with no significant differencesbetween WS3 and WS4 for any of the monthly cover means in 1994. Similar seasonal patterns found for herblayer cover—and lack of significant differences between watersheds—were also evident for species diversityand richness. By contrast, there was little seasonal change in herb layer species evenness, which was nearlyidentical between watersheds for all months except October. Seasonal patterns for individual species/speciesgroups were closely similar between watersheds, especially for Viola rotundifolia and Viola spp. Species richnessand species diversity were linearly related to herb layer cover for both WS3 and WS4, suggesting that spatialand temporal increases in cover were more related to recruitment of herb layer species than to growth of existingspecies. Results of this study indicate that there have been negligible responses of the herb layer to 6 yr of additions to WS3.
Resumo:
OBJECT: In this study, 1H magnetic resonance (MR) spectroscopy was prospectively tested as a reliable method for presurgical grading of neuroepithelial brain tumors. METHODS: Using a database of tumor spectra obtained in patients with histologically confirmed diagnoses, 94 consecutive untreated patients were studied using single-voxel 1H spectroscopy (point-resolved spectroscopy; TE 135 msec, TE 135 msec, TR 1500 msec). A total of 90 tumor spectra obtained in patients with diagnostic 1H MR spectroscopy examinations were analyzed using commercially available software (MRUI/VARPRO) and classified using linear discriminant analysis as World Health Organization (WHO) Grade I/II, WHO Grade III, or WHO Grade IV lesions. In all cases, the classification results were matched with histopathological diagnoses that were made according to the WHO classification criteria after serial stereotactic biopsy procedures or open surgery. Histopathological studies revealed 30 Grade I/II tumors, 29 Grade III tumors, and 31 Grade IV tumors. The reliability of the histological diagnoses was validated considering a minimum postsurgical follow-up period of 12 months (range 12-37 months). Classifications based on spectroscopic data yielded 31 tumors in Grade I/II, 32 in Grade III, and 27 in Grade IV. Incorrect classifications included two Grade II tumors, one of which was identified as Grade III and one as Grade IV; two Grade III tumors identified as Grade II; two Grade III lesions identified as Grade IV; and six Grade IV tumors identified as Grade III. Furthermore, one glioblastoma (WHO Grade IV) was classified as WHO Grade I/II. This represents an overall success rate of 86%, and a 95% success rate in differentiating low-grade from high-grade tumors. CONCLUSIONS: The authors conclude that in vivo 1H MR spectroscopy is a reliable technique for grading neuroepithelial brain tumors.
Resumo:
The synchronization of dynamic multileaf collimator (DMLC) response with respiratory motion is critical to ensure the accuracy of DMLC-based four dimensional (4D) radiation delivery. In practice, however, a finite time delay (response time) between the acquisition of tumor position and multileaf collimator response necessitates predictive models of respiratory tumor motion to synchronize radiation delivery. Predicting a complex process such as respiratory motion introduces geometric errors, which have been reported in several publications. However, the dosimetric effect of such errors on 4D radiation delivery has not yet been investigated. Thus, our aim in this work was to quantify the dosimetric effects of geometric error due to prediction under several different conditions. Conformal and intensity modulated radiation therapy (IMRT) plans for a lung patient were generated for anterior-posterior/posterior-anterior (AP/PA) beam arrangements at 6 and 18 MV energies to provide planned dose distributions. Respiratory motion data was obtained from 60 diaphragm-motion fluoroscopy recordings from five patients. A linear adaptive filter was employed to predict the tumor position. The geometric error of prediction was defined as the absolute difference between predicted and actual positions at each diaphragm position. Distributions of geometric error of prediction were obtained for all of the respiratory motion data. Planned dose distributions were then convolved with distributions for the geometric error of prediction to obtain convolved dose distributions. The dosimetric effect of such geometric errors was determined as a function of several variables: response time (0-0.6 s), beam energy (6/18 MV), treatment delivery (3D/4D), treatment type (conformal/IMRT), beam direction (AP/PA), and breathing training type (free breathing/audio instruction/visual feedback). Dose difference and distance-to-agreement analysis was employed to quantify results. Based on our data, the dosimetric impact of prediction (a) increased with response time, (b) was larger for 3D radiation therapy as compared with 4D radiation therapy, (c) was relatively insensitive to change in beam energy and beam direction, (d) was greater for IMRT distributions as compared with conformal distributions, (e) was smaller than the dosimetric impact of latency, and (f) was greatest for respiration motion with audio instructions, followed by visual feedback and free breathing. Geometric errors of prediction that occur during 4D radiation delivery introduce dosimetric errors that are dependent on several factors, such as response time, treatment-delivery type, and beam energy. Even for relatively small response times of 0.6 s into the future, dosimetric errors due to prediction could approach delivery errors when respiratory motion is not accounted for at all. To reduce the dosimetric impact, better predictive models and/or shorter response times are required.
Resumo:
Background Young children are known to be the most frequent hospital users compared to older children and young adults. Therefore, they are an important population from economic and policy perspectives of health care delivery. In Switzerland complete hospitalization discharge records for children [<5 years] of four consecutive years [2002–2005] were evaluated in order to analyze variation in patterns of hospital use. Methods Stationary and outpatient hospitalization rates on aggregated ZIP code level were calculated based on census data provided by the Swiss federal statistical office (BfS). Thirty-seven hospital service areas for children [HSAP] were created with the method of "small area analysis", reflecting user-based health markets. Descriptive statistics and general linear models were applied to analyze the data. Results The mean stationary hospitalization rate over four years was 66.1 discharges per 1000 children. Hospitalizations for respiratory problem are most dominant in young children (25.9%) and highest hospitalization rates are associated with geographical factors of urban areas and specific language regions. Statistical models yielded significant effect estimates for these factors and a significant association between ambulatory/outpatient and stationary hospitalization rates. Conclusion The utilization-based approach, using HSAP as spatial representation of user-based health markets, is a valid instrument and allows assessing the supply and demand of children's health care services. The study provides for the first time estimates for several factors associated with the large variation in the utilization and provision of paediatric health care resources in Switzerland.
Resumo:
BACKGROUND: In clinical practice a diagnosis is based on a combination of clinical history, physical examination and additional diagnostic tests. At present, studies on diagnostic research often report the accuracy of tests without taking into account the information already known from history and examination. Due to this lack of information, together with variations in design and quality of studies, conventional meta-analyses based on these studies will not show the accuracy of the tests in real practice. By using individual patient data (IPD) to perform meta-analyses, the accuracy of tests can be assessed in relation to other patient characteristics and allows the development or evaluation of diagnostic algorithms for individual patients. In this study we will examine these potential benefits in four clinical diagnostic problems in the field of gynaecology, obstetrics and reproductive medicine. METHODS/DESIGN: Based on earlier systematic reviews for each of the four clinical problems, studies are considered for inclusion. The first authors of the included studies will be invited to participate and share their original data. After assessment of validity and completeness the acquired datasets are merged. Based on these data, a series of analyses will be performed, including a systematic comparison of the results of the IPD meta-analysis with those of a conventional meta-analysis, development of multivariable models for clinical history alone and for the combination of history, physical examination and relevant diagnostic tests and development of clinical prediction rules for the individual patients. These will be made accessible for clinicians. DISCUSSION: The use of IPD meta-analysis will allow evaluating accuracy of diagnostic tests in relation to other relevant information. Ultimately, this could increase the efficiency of the diagnostic work-up, e.g. by reducing the need for invasive tests and/or improving the accuracy of the diagnostic workup. This study will assess whether these benefits of IPD meta-analysis over conventional meta-analysis can be exploited and will provide a framework for future IPD meta-analyses in diagnostic and prognostic research.