998 resultados para Portabel Document Format
Resumo:
In this thesis we investigate the use of quantum probability theory for ranking documents. Quantum probability theory is used to estimate the probability of relevance of a document given a user's query. We posit that quantum probability theory can lead to a better estimation of the probability of a document being relevant to a user's query than the common approach, i. e. the Probability Ranking Principle (PRP), which is based upon Kolmogorovian probability theory. Following our hypothesis, we formulate an analogy between the document retrieval scenario and a physical scenario, that of the double slit experiment. Through the analogy, we propose a novel ranking approach, the quantum probability ranking principle (qPRP). Key to our proposal is the presence of quantum interference. Mathematically, this is the statistical deviation between empirical observations and expected values predicted by the Kolmogorovian rule of additivity of probabilities of disjoint events in configurations such that of the double slit experiment. We propose an interpretation of quantum interference in the document ranking scenario, and examine how quantum interference can be effectively estimated for document retrieval. To validate our proposal and to gain more insights about approaches for document ranking, we (1) analyse PRP, qPRP and other ranking approaches, exposing the assumptions underlying their ranking criteria and formulating the conditions for the optimality of the two ranking principles, (2) empirically compare three ranking principles (i. e. PRP, interactive PRP, and qPRP) and two state-of-the-art ranking strategies in two retrieval scenarios, those of ad-hoc retrieval and diversity retrieval, (3) analytically contrast the ranking criteria of the examined approaches, exposing similarities and differences, (4) study the ranking behaviours of approaches alternative to PRP in terms of the kinematics they impose on relevant documents, i. e. by considering the extent and direction of the movements of relevant documents across the ranking recorded when comparing PRP against its alternatives. Our findings show that the effectiveness of the examined ranking approaches strongly depends upon the evaluation context. In the traditional evaluation context of ad-hoc retrieval, PRP is empirically shown to be better or comparable to alternative ranking approaches. However, when we turn to examine evaluation contexts that account for interdependent document relevance (i. e. when the relevance of a document is assessed also with respect to other retrieved documents, as it is the case in the diversity retrieval scenario) then the use of quantum probability theory and thus of qPRP is shown to improve retrieval and ranking effectiveness over the traditional PRP and alternative ranking strategies, such as Maximal Marginal Relevance, Portfolio theory, and Interactive PRP. This work represents a significant step forward regarding the use of quantum theory in information retrieval. It demonstrates in fact that the application of quantum theory to problems within information retrieval can lead to improvements both in modelling power and retrieval effectiveness, allowing the constructions of models that capture the complexity of information retrieval situations. Furthermore, the thesis opens up a number of lines for future research. These include: (1) investigating estimations and approximations of quantum interference in qPRP; (2) exploiting complex numbers for the representation of documents and queries, and; (3) applying the concepts underlying qPRP to tasks other than document ranking.
Resumo:
Management of the industrial nations' hazardous waste is a current and exponentially increasing, global threatening situation. Improved environmental information must be obtained and managed concerning the current status, temporal dynamics and potential future status of these critical sites. To test the application of spatial environmental techniques to the problem of hazardous waste sites, as Superfund (CERCLA) test site was chosen in an industrial/urban valley experiencing severe TCE, PCE, and CTC ground water contamination. A paradigm is presented for investigating spatial/environmental tools available for the mapping, monitoring and modelling of the environment and its toxic contaminated plumes. This model incorporates a range of technical issues concerning the collection of data as augmented by remotely sensed tools, the format and storage of data utilizing geographic information systems, and the analysis and modelling of environment through the use of advance GIS analysis algorithms and geophysic models of hydrologic transport including statistical surface generation. This spatial based approach is evaluated against the current government/industry standards of operations. Advantages and lessons learned of the spatial approach are discussed.
Resumo:
In the current market, extensive software development is taking place and the software industry is thriving. Major software giants have stated source code theft as a major threat to revenues. By inserting an identity-establishing watermark in the source code, a company can prove it's ownership over the source code. In this paper, we propose a watermarking scheme for C/C++ source codes by exploiting the language restrictions. If a function calls another function, the latter needs to be defined in the code before the former, unless one uses function pre-declarations. We embed the watermark in the code by imposing an ordering on the mutually independent functions by introducing bogus dependency. Removal of dependency by the attacker to erase the watermark requires extensive manual intervention thereby making the attack infeasible. The scheme is also secure against subtractive and additive attacks. Using our watermarking scheme, an n-bit watermark can be embedded in a program having n independent functions. The scheme is implemented on several sample codes and performance changes are analyzed.
Resumo:
Newsletter ACM SIGIR Forum: The Seventeenth Australian Document Computing Symposium was held in Dunedin, New Zealand on the 5th and 6th of December 2012. In total twenty four papers were submitted. From those eleven were accepted for full presentation and 8 for short presentation. A poster session was held jointly with the Australasian Language Technology Workshop.
Resumo:
With the growing size and variety of social media files on the web, it’s becoming critical to efficiently organize them into clusters for further processing. This paper presents a novel scalable constrained document clustering method that harnesses the power of search engines capable of dealing with large text data. Instead of calculating distance between the documents and all of the clusters’ centroids, a neighborhood of best cluster candidates is chosen using a document ranking scheme. To make the method faster and less memory dependable, the in-memory and in-database processing are combined in a semi-incremental manner. This method has been extensively tested in the social event detection application. Empirical analysis shows that the proposed method is efficient both in computation and memory usage while producing notable accuracy.
Resumo:
This article presents a study of how humans perceive and judge the relevance of documents. Humans are adept at making reasonably robust and quick decisions about what information is relevant to them, despite the ever increasing complexity and volume of their surrounding information environment. The literature on document relevance has identified various dimensions of relevance (e.g., topicality, novelty, etc.), however little is understood about how these dimensions may interact. We performed a crowdsourced study of how human subjects judge two relevance dimensions in relation to document snippets retrieved from an internet search engine. The order of the judgment was controlled. For those judgments exhibiting an order effect, a q–test was performed to determine whether the order effects can be explained by a quantum decision model based on incompatible decision perspectives. Some evidence of incompatibility was found which suggests incompatible decision perspectives is appropriate for explaining interacting dimensions of relevance in such instances.
Impact of child labor on academic performance : evidence from the program "Edúcame Primero Colombia"
Resumo:
In this study, the effects of different variables of child labor on academic performance are investigated. To this end, 3302 children participating in the child labor eradication program “Edúcame Primero Colombia” were interviewed. The interview format used for the children's enrollment into the program was a template from which socioeconomic conditions, academic performance, and child labor variables were evaluated. The academic performance factor was determined using the Analytic Hierarchy Process (AHP). The data were analyzed through a logistic regression model that took into account children who engaged in a type of labor (n = 921). The results showed that labor conditions, the number of weekly hours dedicated to work, and the presence of work scheduled in the morning negatively affected the academic performance of child laborers. These results show that the relationship between child labor and academic performance is based on the conflict between these two activities. These results do not indicate a linear and simple relationship associated with the recognition of the presence or absence of child labor. This study has implications for the formulation of policies, programs, and interventions for preventing, eradicating, and attenuating the negative effects of child labor on the social and educational development of children.
Resumo:
This thesis presents new methods for classification and thematic grouping of billions of web pages, at scales previously not achievable. This process is also known as document clustering, where similar documents are automatically associated with clusters that represent various distinct topic. These automatically discovered topics are in turn used to improve search engine performance by only searching the topics that are deemed relevant to particular user queries.
Resumo:
This paper presents the details of an experimental study of a cold-formed steel hollow flange channel beam known as LiteSteel beam (LSB) subject to web crippling under End Two Flange (ETF) and Interior Two Flange (ITF) load cases. The LSB sections with two rectangular hollow flanges are made using a simultaneous cold-forming and electric resistance welding process. Due to the geometry of the LSB, and its unique residual stress characteristics and initial geometric imperfections, much of the existing research for common cold-formed steel sections is not directly applicable to LSB. Experimental and numerical studies have been carried out to evaluate the behaviour and design of LSBs subject to pure bending, predominant shear and combined actions. To date, however, no investigation has been conducted on the web crippling behaviour and strength of LSB sections. Hence an experimental study was conducted to investigate the web crippling behaviour and capacities of LSBs. Twenty-eight web crippling tests were conducted under ETF and ITF load cases, and the ultimate web crippling capacities were compared with the predictions from the design equations in AS/NZS 4600 and AISI S100. This comparison showed that AS/NZS 4600 and AISI S100 web crippling design equations are unconservative for LSB sections under ETF and ITF load cases. Hence new equations were proposed to determine the web crippling capacities of LSBs based on experimental results. Suitable design rules were also developed under the direct strength method (DSM) format.
Resumo:
Cold-formed high strength steel members are increasingly used as primary load bearing components in low rise buildings. Lipped channel beam (LCB) is one of the most commonly used flexural members in these applications. In this research an experimental study was undertaken to investigate the shear behaviour and strengths of LCB sections. Simply supported test specimens of back to back LCBs with aspect ratios of 1.0 and 1.5 were loaded at mid-span until failure. Test specimens were chosen such that all three types of shear failure (shear yielding, inelastic and elastic shear buckling) occurred in the tests. The ultimate shear capacity results obtained from the tests were compared with the predictions from the current design rules in Australian/NewZealand and American cold-formed steel design standards. This comparison showed that these shear design rules are very conservative as they did not include the post-buckling strength observed in the shear tests and the higher shear buckling coefficient due to the additional fixity along the web-flange juncture. Improved shear design equations are proposed in this paper by including the above beneficial effects. Suitable lower bound design rules were also developed under the direct strength method format. This paper presents the details of this experimental study and the results including the improved design rules for the shear capacity of LCBs. It also includes the details of tests of LCBs subject to combined shear and flange distortion, and combined bending and shear actions, and proposes suitable design rules to predict the capacities in these cases.
Resumo:
The use of ‘topic’ concepts has shown improved search performance, given a query, by bringing together relevant documents which use different terms to describe a higher level concept. In this paper, we propose a method for discovering and utilizing concepts in indexing and search for a domain specific document collection being utilized in industry. This approach differs from others in that we only collect focused concepts to build the concept space and that instead of turning a user’s query into a concept based query, we experiment with different techniques of combining the original query with a concept query. We apply the proposed approach to a real-world document collection and the results show that in this scenario the use of concept knowledge at index and search can improve the relevancy of results.
Resumo:
Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.
Resumo:
For traditional information filtering (IF) models, it is often assumed that the documents in one collection are only related to one topic. However, in reality users’ interests can be diverse and the documents in the collection often involve multiple topics. Topic modelling was proposed to generate statistical models to represent multiple topics in a collection of documents, but in a topic model, topics are represented by distributions over words which are limited to distinctively represent the semantics of topics. Patterns are always thought to be more discriminative than single terms and are able to reveal the inner relations between words. This paper proposes a novel information filtering model, Significant matched Pattern-based Topic Model (SPBTM). The SPBTM represents user information needs in terms of multiple topics and each topic is represented by patterns. More importantly, the patterns are organized into groups based on their statistical and taxonomic features, from which the more representative patterns, called Significant Matched Patterns, can be identified and used to estimate the document relevance. Experiments on benchmark data sets demonstrate that the SPBTM significantly outperforms the state-of-the-art models.
Resumo:
QUT Library Research Support has simplified and streamlined the process of research data management planning, storage, discovery and reuse through collaboration and the use of integrated and tailored online tools, and a simplification of the metadata schema. This poster presents the integrated data management services a QUT, including QUT’s Data Management Planning Tool, Research Data Finder, Spatial Data Finder and Software Finder, and information on the simplified Registry Interchange Format – Collections and Services (RIF-CS) Schema. The QUT Data Management Planning (DMP) Tool was built using the Digital Curation Centre’s DMP Online Tool and modified to QUT’s needs and policies. The tool allows researchers and Higher Degree Research students to plan how to handle research data throughout the active phase of their research. The plan is promoted as a ‘live’ document’ and researchers are encouraged to update it as required. The information entered into the plan can be made private or shared with supervisors, project members and external examiners. A plan is mandatory when requesting storage space on the QUT Research Data Storage Service. QUT’s Research Data Finder is integrated with QUT’s Academic Profiles and the Data Management Planning Tool to create a seamless data management process. This process aims to encourage the creation of high quality rich records which facilitate discovery and reuse of quality data. The Registry Interchange Format – Collections and Services (RIF-CS) Schema that is used in the QUT Research Data Finder was simplified to “RIF-CS lite” to reflect mandatory and optional metadata requirements. RIF-CS lite removed schema fields that were underused or extra to the needs of the users and system. This has reduced the amount of metadata fields required from users and made integration of systems a far more simple process where field content is easily shared across services making the process of collecting metadata as transparent as possible.