888 resultados para Dataset
Resumo:
In this paper we deal with a Bayesian analysis for right-censored survival data suitable for populations with a cure rate. We consider a cure rate model based on the negative binomial distribution, encompassing as a special case the promotion time cure model. Bayesian analysis is based on Markov chain Monte Carlo (MCMC) methods. We also present some discussion on model selection and an illustration with a real dataset.
Resumo:
Credit scoring modelling comprises one of the leading formal tools for supporting the granting of credit. Its core objective consists of the generation of a score by means of which potential clients can be listed in the order of the probability of default. A critical factor is whether a credit scoring model is accurate enough in order to provide correct classification of the client as a good or bad payer. In this context the concept of bootstraping aggregating (bagging) arises. The basic idea is to generate multiple classifiers by obtaining the predicted values from the fitted models to several replicated datasets and then combining them into a single predictive classification in order to improve the classification accuracy. In this paper we propose a new bagging-type variant procedure, which we call poly-bagging, consisting of combining predictors over a succession of resamplings. The study is derived by credit scoring modelling. The proposed poly-bagging procedure was applied to some different artificial datasets and to a real granting of credit dataset up to three successions of resamplings. We observed better classification accuracy for the two-bagged and the three-bagged models for all considered setups. These results lead to a strong indication that the poly-bagging approach may promote improvement on the modelling performance measures, while keeping a flexible and straightforward bagging-type structure easy to implement. (C) 2011 Elsevier Ltd. All rights reserved.
Resumo:
For the first time, we introduce a class of transformed symmetric models to extend the Box and Cox models to more general symmetric models. The new class of models includes all symmetric continuous distributions with a possible non-linear structure for the mean and enables the fitting of a wide range of models to several data types. The proposed methods offer more flexible alternatives to Box-Cox or other existing procedures. We derive a very simple iterative process for fitting these models by maximum likelihood, whereas a direct unconditional maximization would be more difficult. We give simple formulae to estimate the parameter that indexes the transformation of the response variable and the moments of the original dependent variable which generalize previous published results. We discuss inference on the model parameters. The usefulness of the new class of models is illustrated in one application to a real dataset.
Resumo:
In the present study, we propose a theoretical graph procedure to investigate multiple pathways in brain functional networks. By taking into account all the possible paths consisting of h links between the nodes pairs of the network, we measured the global network redundancy R (h) as the number of parallel paths and the global network permeability P (h) as the probability to get connected. We used this procedure to investigate the structural and dynamical changes in the cortical networks estimated from a dataset of high-resolution EEG signals in a group of spinal cord injured (SCI) patients during the attempt of foot movement. In the light of a statistical contrast with a healthy population, the permeability index P (h) of the SCI networks increased significantly (P < 0.01) in the Theta frequency band (3-6 Hz) for distances h ranging from 2 to 4. On the contrary, no significant differences were found between the two populations for the redundancy index R (h) . The most significant changes in the brain functional network of SCI patients occurred mainly in the lower spectral contents. These changes were related to an improved propagation of communication between the closest cortical areas rather than to a different level of redundancy. This evidence strengthens the hypothesis of the need for a higher functional interaction among the closest ROIs as a mechanism to compensate the lack of feedback from the peripheral nerves to the sensomotor areas.
Resumo:
This work presents a novel approach in order to increase the recognition power of Multiscale Fractal Dimension (MFD) techniques, when applied to image classification. The proposal uses Functional Data Analysis (FDA) with the aim of enhancing the MFD technique precision achieving a more representative descriptors vector, capable of recognizing and characterizing more precisely objects in an image. FDA is applied to signatures extracted by using the Bouligand-Minkowsky MFD technique in the generation of a descriptors vector from them. For the evaluation of the obtained improvement, an experiment using two datasets of objects was carried out. A dataset was used of characters shapes (26 characters of the Latin alphabet) carrying different levels of controlled noise and a dataset of fish images contours. A comparison with the use of the well-known methods of Fourier and wavelets descriptors was performed with the aim of verifying the performance of FDA method. The descriptor vectors were submitted to Linear Discriminant Analysis (LDA) classification method and we compared the correctness rate in the classification process among the descriptors methods. The results demonstrate that FDA overcomes the literature methods (Fourier and wavelets) in the processing of information extracted from the MFD signature. In this way, the proposed method can be considered as an interesting choice for pattern recognition and image classification using fractal analysis.
Resumo:
Thermophilic endo-1,3(4)-beta-glucanase (laminarinase) from Rhodothermus marinus was crystallized by the hanging-drop vapor diffusion method. The needle-like crystals belong to space group P2(1) and contain two protein molecules in the asymmetric unit with a solvent content of 51.75%. Diffraction data were collected to a resolution of 1.95 angstrom and resulted in a dataset with an overall R-merge of 10.4% and a completeness of 97.8%. Analysis of the structure factors revealed pseudomerohedral twinning of the crystals with a twin fraction of approximately 42%.
Resumo:
An integrated whole-rock petrographic and geochemical study has been carried out on kamafugites and kimberlites of the Late Cretaceous Alto Paranaiba igneous province, in Brazil, and their main minerals, olivine, clinopyroxene, perovskite, phlogopite, spinels and ilmenite. Perovskite is by far the dominant repository for light lanthanides, Nb, Ta, Th and U, and occasionally other elements, reaching concentrations up to 3.4 x 10(4) chondrite values for light lanthanides and 105 chondrite for Th. A very strong fractionation between light and heavy lanthanides (chondrite-normalized La/Yb from similar to 175 to similar to 2000) is also observed. This is likely the first comprehensive dataset on natural perovskite. Clinopyroxene has variable trace-element contents. likely due to the different position of this phase in the crystallization sequence; Sc reaches values as high as 200 ppm whereas the lanthanides show very variable enrichment in light over heavy REE, and commonly show a negative Eu anomaly. The olivine, phlogopite (and tetra-ferriphlogopite), Cr-Ti oxide and ilmenite are substantially barren minerals for lanthanides and most other trace elements, with the exception of Ba, Cs and Rb in mica, and V, Nb and Ta in ilmenite. Estimated mineral/whole-rock partition coefficients for lanthanides in perovskite are similar to previous determinations, though much higher than those calculated in experiments with synthetic compositions, testifying once more to the complex behavior of these elements in a natural environment. The enormous potential for exploitation of lanthanides, Th, U and high-field-strength elements in the Brazilian kamafugites, kimberlites and related rocks is clearly shown.
Resumo:
The southwestern margin of the Eastern Ghats Belt characteristically exposes mafic dykes intruding massif-type charnockites. Dykes of olivine basalt of alkaline composition have characteristic trace element signatures comparable with Ocean Island Basalt (OIB). Most importantly strong positive Nb anomaly and low values of Zr/Nb ratio are consistent with OIB source of the mafic dykes. K-Ar isotopic data indicate two cooling ages at 740 and 530 Ma. The Pan-African thermal event could be related to reactivation of major shear zones and represented by leuco-granite vein along minor shear bands. And 740 Ma cooling age may indicate the low grade metamorphic imprints, noted in some of the dykes. Although no intrusion age could be determined from the present dataset, it could be constrained by some age data of the host charnockite gneiss and Alkaline rocks of the adjacent Prakasam Province. Assuming an intrusion age of similar to 1.3 Ga, Sr-Nd isotopic composition of the dykes indicate that they preserved time-integrated LREE enrichment. In view of the chemical signatures of OIB source, the mafic dykes could as well be related to continental rifting, around 1.3 Ga, which may have been initiated by intra-plate volcanism.
Resumo:
The shuttle radar topography mission (SRTM), was flow on the space shuttle Endeavour in February 2000, with the objective of acquiring a digital elevation model of all land between 60 degrees north latitude and 56 degrees south latitude, using interferometric synthetic aperture radar (InSAR) techniques. The SRTM data are distributed at horizontal resolution of 1 arc-second (similar to 30m) for areas within the USA and at 3 arc-second (similar to 90m) resolution for the rest of the world. A resolution of 90m can be considered suitable for the small or medium-scale analysis, but it is too coarse for more detailed purposes. One alternative is to interpolate the SRTM data at a finer resolution; it will not increase the level of detail of the original digital elevation model (DEM), but it will lead to a surface where there is the coherence of angular properties (i.e. slope, aspect) between neighbouring pixels, which is an important characteristic when dealing with terrain analysis. This work intents to show how the proper adjustment of variogram and kriging parameters, namely the nugget effect and the maximum distance within which values are used in interpolation, can be set to achieve quality results on resampling SRTM data from 3"" to 1"". We present for a test area in western USA, which includes different adjustment schemes (changes in nugget effect value and in the interpolation radius) and comparisons with the original 1"" model of the area, with the national elevation dataset (NED) DEMs, and with other interpolation methods (splines and inverse distance weighted (IDW)). The basic concepts for using kriging to resample terrain data are: (i) working only with the immediate neighbourhood of the predicted point, due to the high spatial correlation of the topographic surface and omnidirectional behaviour of variogram in short distances; (ii) adding a very small random variation to the coordinates of the points prior to interpolation, to avoid punctual artifacts generated by predicted points with the same location than original data points and; (iii) using a small value of nugget effect, to avoid smoothing that can obliterate terrain features. Drainages derived from the surfaces interpolated by kriging and by splines have a good agreement with streams derived from the 1"" NED, with correct identification of watersheds, even though a few differences occur in the positions of some rivers in flat areas. Although the 1"" surfaces resampled by kriging and splines are very similar, we consider the results produced by kriging as superior, since the spline-interpolated surface still presented some noise and linear artifacts, which were removed by kriging.
Resumo:
In this paper, we present a 3D face photography system based on a facial expression training dataset, composed of both facial range images (3D geometry) and facial texture (2D photography). The proposed system allows one to obtain a 3D geometry representation of a given face provided as a 2D photography, which undergoes a series of transformations through the texture and geometry spaces estimated. In the training phase of the system, the facial landmarks are obtained by an active shape model (ASM) extracted from the 2D gray-level photography. Principal components analysis (PCA) is then used to represent the face dataset, thus defining an orthonormal basis of texture and another of geometry. In the reconstruction phase, an input is given by a face image to which the ASM is matched. The extracted facial landmarks and the face image are fed to the PCA basis transform, and a 3D version of the 2D input image is built. Experimental tests using a new dataset of 70 facial expressions belonging to ten subjects as training set show rapid reconstructed 3D faces which maintain spatial coherence similar to the human perception, thus corroborating the efficiency and the applicability of the proposed system.
Resumo:
In this article, we study some results related to a specific class of distributions, called skew-curved-symmetric family of distributions that depends on a parameter controlling the skewness and kurtosis at the same time. Special elements of this family which are studied include symmetric and well-known asymmetric distributions. General results are given for the score function and the observed information matrix. It is shown that the observed information matrix is always singular for some special cases. We illustrate the flexibility of this class of distributions with an application to a real dataset on characteristics of Australian athletes.
Resumo:
Some sesquiterpene lactones (SLs) are the active compounds of a great number of traditionally medicinal plants from the Asteraceae family and possess considerable cytotoxic activity. Several studies in vitro have shown the inhibitory activity against cells derived from human carcinoma of the nasopharynx (KB). Chemical studies showed that the cytotoxic activity is due to the reaction of alpha,beta-unsaturated carbonyl structures of the SLs with thiols, such as cysteine. These studies support the view that SLs inhibit tumour growth by selective alkylation of growth-regulatory biological macromolecules, such as key enzymes, which control cell division, thereby inhibiting a variety of cellular functions, which directs the cells into apoptosis. In this study we investigated a set of 55 different sesquiterpene lactones, represented by 5 skeletons (22 germacranolides, 6 elemanolides, 2 eudesmanolides, 16 guaianolides and nor-derivatives and 9 pseudoguaianolides), in respect to their cytotoxic properties. The experimental results and 3D molecular descriptors were submitted to Kohonen self-organizing map (SOM) to classify (training set) and predict (test set) the cytotoxic activity. From the obtained results, it was concluded that only the geometrical descriptors showed satisfactory values. The Kohonen map obtained after training set using 25 geometrical descriptors shows a very significant match, mainly among the inactive compounds (similar to 84%). Analyzing both groups, the percentage seen is high (83%). The test set shows the highest match, where 89% of the substances had their cytotoxic activity correctly predicted. From these results, important properties for the inhibition potency are discussed for the whole dataset and for subsets of the different structural skeletons. (C) 2008 Elsevier Masson SAS. All rights reserved.
Resumo:
Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Due to the free nature of Wikipedia and allowing open access to everyone to edit articles the quality of articles may be affected. As all people don’t have equal level of knowledge and also different people have different opinions about a topic so there may be difference between the contributions made by different authors. To overcome this situation it is very important to classify the articles so that the articles of good quality can be separated from the poor quality articles and should be removed from the database. The aim of this study is to classify the articles of Wikipedia into two classes class 0 (poor quality) and class 1(good quality) using the Adaptive Neuro Fuzzy Inference System (ANFIS) and data mining techniques. Two ANFIS are built using the Fuzzy Logic Toolbox [1] available in Matlab. The first ANFIS is based on the rules obtained from J48 classifier in WEKA while the other one was built by using the expert’s knowledge. The data used for this research work contains 226 article’s records taken from the German version of Wikipedia. The dataset consists of 19 inputs and one output. The data was preprocessed to remove any similar attributes. The input variables are related to the editors, contributors, length of articles and the lifecycle of articles. In the end analysis of different methods implemented in this research is made to analyze the performance of each classification method used.
Resumo:
The aim of this study is to evaluate the variation of solar radiation data between different data sources that will be free and available at the Solar Energy Research Center (SERC). The comparison between data sources will be carried out for two locations: Stockholm, Sweden and Athens, Greece. For the desired locations, data is gathered for different tilt angles: 0°, 30°, 45°, 60° facing south. The full dataset is available in two excel files: “Stockholm annual irradiation” and “Athens annual irradiation”. The World Radiation Data Center (WRDC) is defined as a reference for the comparison with other dtaasets, because it has the highest time span recorded for Stockholm (1964–2010) and Athens (1964–1986), in form of average monthly irradiation, expressed in kWh/m2. The indicator defined for the data comparison is the estimated standard deviation. The mean biased error (MBE) and the root mean square error (RMSE) were also used as statistical indicators for the horizontal solar irradiation data. The variation in solar irradiation data is categorized in two categories: natural or inter-annual variability, due to different data sources and lastly due to different calculation models. The inter-annual variation for Stockholm is 140.4kWh/m2 or 14.4% and 124.3kWh/m2 or 8.0% for Athens. The estimated deviation for horizontal solar irradiation is 3.7% for Stockholm and 4.4% Athens. This estimated deviation is respectively equal to 4.5% and 3.6% for Stockholm and Athens at 30° tilt, 5.2% and 4.5% at 45° tilt, 5.9% and 7.0% at 60°. NASA’s SSE, SAM and RETScreen (respectively Satel-light) exhibited the highest deviation from WRDC’s data for Stockholm (respectively Athens). The essential source for variation is notably the difference in horizontal solar irradiation. The variation increases by 1-2% per degree of tilt, using different calculation models, as used in PVSYST and Meteonorm. The location and altitude of the data source did not directly influence the variation with the WRDC data. Further examination is suggested in order to improve the methodology of selecting the location; Examining the functional dependence of ground reflected radiation with ambient temperature; variation of ambient temperature and its impact on different solar energy systems; Im pact of variation in solar irradiation and ambient temperature on system output.
Resumo:
Random effect models have been widely applied in many fields of research. However, models with uncertain design matrices for random effects have been little investigated before. In some applications with such problems, an expectation method has been used for simplicity. This method does not include the extra information of uncertainty in the design matrix is not included. The closed solution for this problem is generally difficult to attain. We therefore propose an two-step algorithm for estimating the parameters, especially the variance components in the model. The implementation is based on Monte Carlo approximation and a Newton-Raphson-based EM algorithm. As an example, a simulated genetics dataset was analyzed. The results showed that the proportion of the total variance explained by the random effects was accurately estimated, which was highly underestimated by the expectation method. By introducing heuristic search and optimization methods, the algorithm can possibly be developed to infer the 'model-based' best design matrix and the corresponding best estimates.