867 resultados para Cascaded classifier


Relevância:

10.00% 10.00%

Publicador:

Resumo:

Conventional web search engines are centralised in that a single entity crawls and indexes the documents selected for future retrieval, and the relevance models used to determine which documents are relevant to a given user query. As a result, these search engines suffer from several technical drawbacks such as handling scale, timeliness and reliability, in addition to ethical concerns such as commercial manipulation and information censorship. Alleviating the need to rely entirely on a single entity, Peer-to-Peer (P2P) Information Retrieval (IR) has been proposed as a solution, as it distributes the functional components of a web search engine – from crawling and indexing documents, to query processing – across the network of users (or, peers) who use the search engine. This strategy for constructing an IR system poses several efficiency and effectiveness challenges which have been identified in past work. Accordingly, this thesis makes several contributions towards advancing the state of the art in P2P-IR effectiveness by improving the query processing and relevance scoring aspects of a P2P web search. Federated search systems are a form of distributed information retrieval model that route the user’s information need, formulated as a query, to distributed resources and merge the retrieved result lists into a final list. P2P-IR networks are one form of federated search in routing queries and merging result among participating peers. The query is propagated through disseminated nodes to hit the peers that are most likely to contain relevant documents, then the retrieved result lists are merged at different points along the path from the relevant peers to the query initializer (or namely, customer). However, query routing in P2P-IR networks is considered as one of the major challenges and critical part in P2P-IR networks; as the relevant peers might be lost in low-quality peer selection while executing the query routing, and inevitably lead to less effective retrieval results. This motivates this thesis to study and propose query routing techniques to improve retrieval quality in such networks. Cluster-based semi-structured P2P-IR networks exploit the cluster hypothesis to organise the peers into similar semantic clusters where each such semantic cluster is managed by super-peers. In this thesis, I construct three semi-structured P2P-IR models and examine their retrieval effectiveness. I also leverage the cluster centroids at the super-peer level as content representations gathered from cooperative peers to propose a query routing approach called Inverted PeerCluster Index (IPI) that simulates the conventional inverted index of the centralised corpus to organise the statistics of peers’ terms. The results show a competitive retrieval quality in comparison to baseline approaches. Furthermore, I study the applicability of using the conventional Information Retrieval models as peer selection approaches where each peer can be considered as a big document of documents. The experimental evaluation shows comparative and significant results and explains that document retrieval methods are very effective for peer selection that brings back the analogy between documents and peers. Additionally, Learning to Rank (LtR) algorithms are exploited to build a learned classifier for peer ranking at the super-peer level. The experiments show significant results with state-of-the-art resource selection methods and competitive results to corresponding classification-based approaches. Finally, I propose reputation-based query routing approaches that exploit the idea of providing feedback on a specific item in the social community networks and manage it for future decision-making. The system monitors users’ behaviours when they click or download documents from the final ranked list as implicit feedback and mines the given information to build a reputation-based data structure. The data structure is used to score peers and then rank them for query routing. I conduct a set of experiments to cover various scenarios including noisy feedback information (i.e, providing positive feedback on non-relevant documents) to examine the robustness of reputation-based approaches. The empirical evaluation shows significant results in almost all measurement metrics with approximate improvement more than 56% compared to baseline approaches. Thus, based on the results, if one were to choose one technique, reputation-based approaches are clearly the natural choices which also can be deployed on any P2P network.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Context Seed dispersal is recognized as having profound effects on the distribution, dynamics and structure of plant populations and communities. However, knowledge of how landscape structure shapes carnivore-mediated seed dispersal patterns is still scarce, thereby limiting our understanding of large-scale plant population processes. Objectives We aim to determine how the amount and spatial configuration of forest cover impacted the relative abundance of carnivorous mammals, and how these effects cascaded through the seed dispersal kernels they generated. Methods Camera traps activated by animal movement were used for carnivore sampling. Colour-coded seed mimics embedded in common figs were used to know the exact origin of the dispersed seed mimics later found in carnivore scats. We applied this procedure in two sites differing in landscape structure. Results We did not find between-site differences in the relative abundance of the principal carnivore species contributing to seed dispersal patterns, Martes foina. Mean dispersal distance and the probability of long dispersal events were higher in the site with spatially continuous and abundant forest cover, compared to the site with spatially aggregated and scarcer forest cover. Seed deposition closely matched the spatial patterning of forest cover in both study sites, suggesting behaviour-based mechanisms underpinning seed dispersal patterns generated by individual frugivore species. Conclusions Our results provide the first empirical evidence of the impact of landscape structure on carnivore-mediated seed dispersal kernels. They also indicate that seed dispersal kernels generated strongly depend on the effect that landscape structure exerts on carnivore populations, particularly on habitat-use preferences.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper presents our approach of identifying the profile of an unknown user based on the activities of known users. The aim of author profiling task of PAN@CLEF 2016 is cross-genre identification of the gender and age of an unknown user. This means training the system using the behavior of different users from one social media platform and identifying the profile of other user on some different platform. Instead of using single classifier to build the system we used a combination of different classifiers, also known as stacking. This approach allowed us explore the strength of all the classifiers and minimize the bias or error enforced by a single classifier.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper describes various experiments done to investigate author profiling of tweets in 4 different languages – English, Dutch, Italian, and Spanish. Profiling consists of age and gender classification, as well as regression on 5 different person- ality dimensions – extroversion, stability, agreeableness, open- ness, and conscientiousness. Different sets of features were tested – bag-of-words, word ngrams, POS ngrams, and average of word embeddings. SVM was used as the classifier. Tfidf worked best for most English tasks while for most of the tasks from the other languages, the combination of the best features worked better.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this article, we describe a novel methodology to extract semantic characteristics from protein structures using linear algebra in order to compose structural signature vectors which may be used efficiently to compare and classify protein structures into fold families. These signatures are built from the pattern of hydrophobic intrachain interactions using Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) techniques. Considering proteins as documents and contacts as terms, we have built a retrieval system which is able to find conserved contacts in samples of myoglobin fold family and to retrieve these proteins among proteins of varied folds with precision of up to 80%. The classifier is a web tool available at our laboratory website. Users can search for similar chains from a specific PDB, view and compare their contact maps and browse their structures using a JMol plug-in.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Collecting ground truth data is an important step to be accomplished before performing a supervised classification. However, its quality depends on human, financial and time ressources. It is then important to apply a validation process to assess the reliability of the acquired data. In this study, agricultural infomation was collected in the Brazilian Amazonian State of Mato Grosso in order to map crop expansion based on MODIS EVI temporal profiles. The field work was carried out through interviews for the years 2005-2006 and 2006-2007. This work presents a methodology to validate the training data quality and determine the optimal sample to be used according to the classifier employed. The technique is based on the detection of outlier pixels for each class and is carried out by computing Mahalanobis distances for each pixel. The higher the distance, the further the pixel is from the class centre. Preliminary observations through variation coefficent validate the efficiency of the technique to detect outliers. Then, various subsamples are defined by applying different thresholds to exclude outlier pixels from the classification process. The classification results prove the robustness of the Maximum Likelihood and Spectral Angle Mapper classifiers. Indeed, those classifiers were insensitive to outlier exclusion. On the contrary, the decision tree classifier showed better results when deleting 7.5% of pixels in the training data. The technique managed to detect outliers for all classes. In this study, few outliers were present in the training data, so that the classification quality was not deeply affected by the outliers.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Crop monitoring and more generally land use change detection are of primary importance in order to analyze spatio-temporal dynamics and its impacts on environment. This aspect is especially true in such a region as the State of Mato Grosso (south of the Brazilian Amazon Basin) which hosts an intensive pioneer front. Deforestation in this region as often been explained by soybean expansion in the last three decades. Remote sensing techniques may now represent an efficient and objective manner to quantify how crops expansion really represents a factor of deforestation through crop mapping studies. Due to the special characteristics of the soybean productions' farms in Mato Grosso (area varying between 1000 hectares and 40000 hectares and individual fields often bigger than 100 hectares), the Moderate Resolution Imaging Spectroradiometer (MODIS) data with a near daily temporal resolution and 250 m spatial resolution can be considered as adequate resources to crop mapping. Especially, multitemporal vegetation indices (VI) studies have been currently used to realize this task [1] [2]. In this study, 16-days compositions of EVI (MODQ13 product) data are used. However, although these data are already processed, multitemporal VI profiles still remain noisy due to cloudiness (which is extremely frequent in a tropical region such as south Amazon Basin), sensor problems, errors in atmospheric corrections or BRDF effect. Thus, many works tried to develop algorithms that could smooth the multitemporal VI profiles in order to improve further classification. The goal of this study is to compare and test different smoothing algorithms in order to select the one which satisfies better to the demand which is classifying crop classes. Those classes correspond to 6 different agricultural managements observed in Mato Grosso through an intensive field work which resulted in mapping more than 1000 individual fields. The agricultural managements above mentioned are based on combination of soy, cotton, corn, millet and sorghum crops sowed in single or double crop systems. Due to the difficulty in separating certain classes because of too similar agricultural calendars, the classification will be reduced to 3 classes : Cotton (single crop), Soy and cotton (double crop), soy (single or double crop with corn, millet or sorghum). The classification will use training data obtained in the 2005-2006 harvest and then be tested on the 2006-2007 harvest. In a first step, four smoothing techniques are presented and criticized. Those techniques are Best Index Slope Extraction (BISE) [3], Mean Value Iteration (MVI) [4], Weighted Least Squares (WLS) [5] and Savitzky-Golay Filter (SG) [6] [7]. These techniques are then implemented and visually compared on a few individual pixels so that it allows doing a first selection between the five studied techniques. The WLS and SG techniques are selected according to criteria proposed by [8]. Those criteria are: ability in eliminating frequent noises, conserving the upper values of the VI profiles and keeping the temporality of the profiles. Those selected algorithms are then programmed and applied to the MODIS/TERRA EVI data (16-days composition periods). Tests of separability are realized based on the Jeffries-Matusita distance in order to see if the algorithms managed in improving the potential of differentiation between the classes. Those tests are realized on the overall profile (comprising 23 MODIS images) as well as on each MODIS sub-period of the profile [1]. This last test is a double interest process because it allows comparing the smoothing techniques and also enables to select a set of images which carries more information on the separability between the classes. Those selected dates can then be used to realize a supervised classification. Here three different classifiers are tested to evaluate if the smoothing techniques as a particular effect on the classification depending on the classifiers used. Those classifiers are Maximum Likelihood classifier, Spectral Angle Mapper (SAM) classifier and CHAID Improved Decision tree. It appears through the separability tests on the overall process that the smoothed profiles don't improve efficiently the potential of discrimination between classes when compared with the original data. However, the same tests realized on the MODIS sub-periods show better results obtained with the smoothed algorithms. The results of the classification confirm this first analyze. The Kappa coefficients are always better with the smoothing techniques and the results obtained with the WLS and SG smoothed profiles are nearly equal. However, the results are different depending on the classifier used. The impact of the smoothing algorithms is much better while using the decision tree model. Indeed, it allows a gain of 0.1 in the Kappa coefficient. While using the Maximum Likelihood end SAM models, the gain remains positive but is much lower (Kappa improved of 0.02 only). Thus, this work's aim is to prove the utility in smoothing the VI profiles in order to improve the final results. However, the choice of the smoothing algorithm has to be made considering the original data used and the classifier models used. In that case the Savitzky-Golay filter gave the better results.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

L'abbandono del cliente, ossia il customer churn, si riferisce a quando un cliente cessa il suo rapporto con l'azienda. In genere, le aziende considerano un cliente come perso quando un determinato periodo di tempo è trascorso dall'ultima interazione del cliente con i servizi dell'azienda. La riduzione del tasso di abbandono è quindi un obiettivo di business chiave per ogni attività. Per riuscire a trattenere i clienti che stanno per abbandonare l'azienda, è necessario: prevedere in anticipo quali clienti abbandoneranno; sapere quali azioni di marketing avranno maggiore impatto sulla fidelizzazione di ogni particolare cliente. L'obiettivo della tesi è lo studio e l'implementazione di un sistema di previsione dell'abbandono dei clienti in una catena di palestre: il sistema è realizzato per conto di Technogym, azienda leader nel mercato del fitness. Technogym offre già un servizio di previsione del rischio di abbandono basato su regole statiche. Tale servizio offre risultati accettabili ma è un sistema che non si adatta automaticamente al variare delle caratteristiche dei clienti nel tempo. Con questa tesi si sono sfruttate le potenzialità offerte dalle tecnologie di apprendimento automatico, per cercare di far fronte ai limiti del sistema storicamente utilizzato dall'azienda. Il lavoro di tesi ha previsto tre macro-fasi: la prima fase è la comprensione e l'analisi del sistema storico, con lo scopo di capire la struttura dei dati, di migliorarne la qualità e di approfondirne tramite analisi statistiche il contenuto informativo in relazione alle features definite dagli algoritmi di apprendimento automatico. La seconda fase ha previsto lo studio, la definizione e la realizzazione di due modelli di ML basati sulle stesse features ma utilizzando due tecnologie differenti: Random Forest Classifier e il servizio AutoML Tables di Google. La terza fase si è concentrata su una valutazione comparativa delle performance dei modelli di ML rispetto al sistema storico.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The aim of this thesis project is to automatically localize HCC tumors in the human liver and subsequently predict if the tumor will undergo microvascular infiltration (MVI), the initial stage of metastasis development. The input data for the work have been partially supplied by Sant'Orsola Hospital and partially downloaded from online medical databases. Two Unet models have been implemented for the automatic segmentation of the livers and the HCC malignancies within it. The segmentation models have been evaluated with the Intersection-over-Union and the Dice Coefficient metrics. The outcomes obtained for the liver automatic segmentation are quite good (IOU = 0.82; DC = 0.35); the outcomes obtained for the tumor automatic segmentation (IOU = 0.35; DC = 0.46) are, instead, affected by some limitations: it can be state that the algorithm is almost always able to detect the location of the tumor, but it tends to underestimate its dimensions. The purpose is to achieve the CT images of the HCC tumors, necessary for features extraction. The 14 Haralick features calculated from the 3D-GLCM, the 120 Radiomic features and the patients' clinical information are collected to build a dataset of 153 features. Now, the goal is to build a model able to discriminate, based on the features given, the tumors that will undergo MVI and those that will not. This task can be seen as a classification problem: each tumor needs to be classified either as “MVI positive” or “MVI negative”. Techniques for features selection are implemented to identify the most descriptive features for the problem at hand and then, a set of classification models are trained and compared. Among all, the models with the best performances (around 80-84% ± 8-15%) result to be the XGBoost Classifier, the SDG Classifier and the Logist Regression models (without penalization and with Lasso, Ridge or Elastic Net penalization).

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Intelligent systems are currently inherent to the society, supporting a synergistic human-machine collaboration. Beyond economical and climate factors, energy consumption is strongly affected by the performance of computing systems. The quality of software functioning may invalidate any improvement attempt. In addition, data-driven machine learning algorithms are the basis for human-centered applications, being their interpretability one of the most important features of computational systems. Software maintenance is a critical discipline to support automatic and life-long system operation. As most software registers its inner events by means of logs, log analysis is an approach to keep system operation. Logs are characterized as Big data assembled in large-flow streams, being unstructured, heterogeneous, imprecise, and uncertain. This thesis addresses fuzzy and neuro-granular methods to provide maintenance solutions applied to anomaly detection (AD) and log parsing (LP), dealing with data uncertainty, identifying ideal time periods for detailed software analyses. LP provides deeper semantics interpretation of the anomalous occurrences. The solutions evolve over time and are general-purpose, being highly applicable, scalable, and maintainable. Granular classification models, namely, Fuzzy set-Based evolving Model (FBeM), evolving Granular Neural Network (eGNN), and evolving Gaussian Fuzzy Classifier (eGFC), are compared considering the AD problem. The evolving Log Parsing (eLP) method is proposed to approach the automatic parsing applied to system logs. All the methods perform recursive mechanisms to create, update, merge, and delete information granules according with the data behavior. For the first time in the evolving intelligent systems literature, the proposed method, eLP, is able to process streams of words and sentences. Essentially, regarding to AD accuracy, FBeM achieved (85.64+-3.69)%; eGNN reached (96.17+-0.78)%; eGFC obtained (92.48+-1.21)%; and eLP reached (96.05+-1.04)%. Besides being competitive, eLP particularly generates a log grammar, and presents a higher level of model interpretability.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this work, we explore and demonstrate the potential for modeling and classification using quantile-based distributions, which are random variables defined by their quantile function. In the first part we formalize a least squares estimation framework for the class of linear quantile functions, leading to unbiased and asymptotically normal estimators. Among the distributions with a linear quantile function, we focus on the flattened generalized logistic distribution (fgld), which offers a wide range of distributional shapes. A novel naïve-Bayes classifier is proposed that utilizes the fgld estimated via least squares, and through simulations and applications, we demonstrate its competitiveness against state-of-the-art alternatives. In the second part we consider the Bayesian estimation of quantile-based distributions. We introduce a factor model with independent latent variables, which are distributed according to the fgld. Similar to the independent factor analysis model, this approach accommodates flexible factor distributions while using fewer parameters. The model is presented within a Bayesian framework, an MCMC algorithm for its estimation is developed, and its effectiveness is illustrated with data coming from the European Social Survey. The third part focuses on depth functions, which extend the concept of quantiles to multivariate data by imposing a center-outward ordering in the multivariate space. We investigate the recently introduced integrated rank-weighted (IRW) depth function, which is based on the distribution of random spherical projections of the multivariate data. This depth function proves to be computationally efficient and to increase its flexibility we propose different methods to explicitly model the projected univariate distributions. Its usefulness is shown in classification tasks: the maximum depth classifier based on the IRW depth is proven to be asymptotically optimal under certain conditions, and classifiers based on the IRW depth are shown to perform well in simulated and real data experiments.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Even without formal guarantees of their effectiveness, adversarial attacks against Machine Learning models frequently fool new defenses. We identify six key asymmetries that contribute to this phenomenon and formulate four guidelines to build future-proof defenses by preventing such asymmetries. We also prove that attacking a classifier is NP-complete, while defending from such attacks is Sigma_2^P-complete. We then introduce Counter-Attack (CA), an asymmetry-free metadefense that determines whether a model is robust on a given input by estimating its distance from the decision boundary. Under specific assumptions CA can provide theoretical detection guarantees. Additionally, we prove that while CA is NP-complete, fooling CA is Sigma_2^P-complete. Even when using heuristic relaxations, we show that our method can reliably identify non-robust points. As part of our experimental evaluation, we introduce UG100, a new dataset obtained by applying a provably optimal attack to six limited-scale networks (three for MNIST and three for CIFAR10), each trained in three different manners.