973 resultados para Compressed text search
Resumo:
XML documents are becoming more and more common in various environments. In particular, enterprise-scale document management is commonly centred around XML, and desktop applications as well as online document collections are soon to follow. The growing number of XML documents increases the importance of appropriate indexing methods and search tools in keeping the information accessible. Therefore, we focus on content that is stored in XML format as we develop such indexing methods. Because XML is used for different kinds of content ranging all the way from records of data fields to narrative full-texts, the methods for Information Retrieval are facing a new challenge in identifying which content is subject to data queries and which should be indexed for full-text search. In response to this challenge, we analyse the relation of character content and XML tags in XML documents in order to separate the full-text from data. As a result, we are able to both reduce the size of the index by 5-6\% and improve the retrieval precision as we select the XML fragments to be indexed. Besides being challenging, XML comes with many unexplored opportunities which are not paid much attention in the literature. For example, authors often tag the content they want to emphasise by using a typeface that stands out. The tagged content constitutes phrases that are descriptive of the content and useful for full-text search. They are simple to detect in XML documents, but also possible to confuse with other inline-level text. Nonetheless, the search results seem to improve when the detected phrases are given additional weight in the index. Similar improvements are reported when related content is associated with the indexed full-text including titles, captions, and references. Experimental results show that for certain types of document collections, at least, the proposed methods help us find the relevant answers. Even when we know nothing about the document structure but the XML syntax, we are able to take advantage of the XML structure when the content is indexed for full-text search.
Resumo:
Search engines sometimes apply the search on the full text of documents or web-pages; but sometimes they can apply the search on selected parts of the documents only, e.g. their titles. Full-text search may consume a lot of computing resources and time. It may be possible to save resources by applying the search on the titles of documents only, assuming that a title of a document provides a concise representation of its content. We tested this assumption using Google search engine. We ran search queries that have been defined by users, distinguishing between two types of queries/users: queries of users who are familiar with the area of the search, and queries of users who are not familiar with the area of the search. We found that searches which use titles provide similar and sometimes even (slightly) better results compared to searches which use the full-text. These results hold for both types of queries/users. Moreover, we found an advantage in title-search when searching in unfamiliar areas because the general terms used in queries in unfamiliar areas match better with general terms which tend to be used in document titles.
Resumo:
介绍了FM-index压缩查询技术,详细阐述了FM—index的工作流程,描述了实现计算字符串在压缩文本中出现次数的算法。对FM-index的源代码在Linux平台上进行了测试,从测试结果分析了使用FM-index进行压缩查询的优点和不足。最后给出了加快FM-index压缩速度的一个并行化算法的初步设计思路。
Resumo:
A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1-3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.
Resumo:
Objective: To summarise the extent to which narrative text fields in administrative health data are used to gather information about the event resulting in presentation to a health care provider for treatment of an injury, and to highlight best practise approaches to conducting narrative text interrogation for injury surveillance purposes.----- Design: Systematic review----- Data sources: Electronic databases searched included CINAHL, Google Scholar, Medline, Proquest, PubMed and PubMed Central.. Snowballing strategies were employed by searching the bibliographies of retrieved references to identify relevant associated articles.----- Selection criteria: Papers were selected if the study used a health-related database and if the study objectives were to a) use text field to identify injury cases or use text fields to extract additional information on injury circumstances not available from coded data or b) use text fields to assess accuracy of coded data fields for injury-related cases or c) describe methods/approaches for extracting injury information from text fields.----- Methods: The papers identified through the search were independently screened by two authors for inclusion, resulting in 41 papers selected for review. Due to heterogeneity between studies metaanalysis was not performed.----- Results: The majority of papers reviewed focused on describing injury epidemiology trends using coded data and text fields to supplement coded data (28 papers), with these studies demonstrating the value of text data for providing more specific information beyond what had been coded to enable case selection or provide circumstantial information. Caveats were expressed in terms of the consistency and completeness of recording of text information resulting in underestimates when using these data. Four coding validation papers were reviewed with these studies showing the utility of text data for validating and checking the accuracy of coded data. Seven studies (9 papers) described methods for interrogating injury text fields for systematic extraction of information, with a combination of manual and semi-automated methods used to refine and develop algorithms for extraction and classification of coded data from text. Quality assurance approaches to assessing the robustness of the methods for extracting text data was only discussed in 8 of the epidemiology papers, and 1 of the coding validation papers. All of the text interrogation methodology papers described systematic approaches to ensuring the quality of the approach.----- Conclusions: Manual review and coding approaches, text search methods, and statistical tools have been utilised to extract data from narrative text and translate it into useable, detailed injury event information. These techniques can and have been applied to administrative datasets to identify specific injury types and add value to previously coded injury datasets. Only a few studies thoroughly described the methods which were used for text mining and less than half of the studies which were reviewed used/described quality assurance methods for ensuring the robustness of the approach. New techniques utilising semi-automated computerised approaches and Bayesian/clustering statistical methods offer the potential to further develop and standardise the analysis of narrative text for injury surveillance.
Resumo:
Purpose - There are many library automation packages available as open-source software, comprising two modules: staff-client module and online public access catalogue (OPAC). Although the OPAC of these library automation packages provides advanced features of searching and retrieval of bibliographic records, none of them facilitate full-text searching. Most of the available open-source digital library software facilitates indexing and searching of full-text documents in different formats. This paper makes an effort to enable full-text search features in the widely used open-source library automation package Koha, by integrating it with two open-source digital library software packages, Greenstone Digital Library Software (GSDL) and Fedora Generic Search Service (FGSS), independently. Design/methodology/approach - The implementation is done by making use of the Search and Retrieval by URL (SRU) feature available in Koha, GSDL and FGSS. The full-text documents are indexed both in Koha and GSDL and FGSS. Findings - Full-text searching capability in Koha is achieved by integrating either GSDL or FGSS into Koha and by passing an SRU request to GSDL or FGSS from Koha. The full-text documents are indexed both in the library automation package (Koha) and digital library software (GSDL, FGSS) Originality/value - This is the first implementation enabling the full-text search feature in a library automation software by integrating it into digital library software.
Resumo:
ACM Computing Classification System (1998): H3.3, H.5.5, J5.
Resumo:
Agile ridesharing aims to utilise the capability of social networks and mobile phones to facilitate people to share vehicles and travel in real time. However the application of social networking technologies in local communities to address issues of personal transport faces significant design challenges. In this paper we describe an iterative design-based approach to exploring this problem and discuss findings from the use of an early prototype. The findings focus upon interaction, privacy and profiling. Our early results suggest that explicitly entering information such as ride data and personal profile data into formal fields for explicit computation of matches, as is done in many systems, may not be the best strategy. It might be preferable to support informal communication and negotiation with text search techniques.
Resumo:
Background: International data on child maltreatment are largely derived from child protection agencies, and predominantly report only substantiated cases of child maltreatment. This approach underestimates the incidence of maltreatment and makes inter-jurisdictional comparisons difficult. There has been a growing recognition of the importance of health professionals in identifying, documenting and reporting suspected child maltreatment. This study aimed to describe the issues around case identification using coded morbidity data, outline methods for selecting and grouping relevant codes, and illustrate patterns of maltreatment identified. Methods: A comprehensive review of the ICD-10-AM classification system was undertaken, including review of index terms, a free text search of tabular volumes, and a review of coding standards pertaining to child maltreatment coding. Identified codes were further categorised into maltreatment types including physical abuse, sexual abuse, emotional or psychological abuse, and neglect. Using these code groupings, one year of Australian hospitalisation data for children under 18 years of age was examined to quantify the proportion of patients identified and to explore the characteristics of cases assigned maltreatment-related codes. Results: Less than 0.5% of children hospitalised in Australia between 2005 and 2006 had a maltreatment code assigned, almost 4% of children with a principal diagnosis of a mental and behavioural disorder and over 1% of children with an injury or poisoning as the principal diagnosis had a maltreatment code assigned. The patterns of children assigned with definitive T74 codes varied by sex and age group. For males selected as having a maltreatment-related presentation, physical abuse was most commonly coded (62.6% of maltreatment cases) while for females selected as having a maltreatment-related presentation, sexual abuse was the most commonly assigned form of maltreatment (52.9% of maltreatment cases). Conclusion: This study has demonstrated that hospital data could provide valuable information for routine monitoring and surveillance of child maltreatment, even in the absence of population-based linked data sources. With national and international calls for a public health response to child maltreatment, better understanding of, investment in and utilisation of our core national routinely collected data sources will enhance the evidence-base needed to support an appropriate response to children at risk.
Resumo:
- Objective To explore the potential for using a basic text search of routine emergency department data to identify product-related injury in infants and to compare the patterns from routine ED data and specialised injury surveillance data. - Methods Data was sourced from the Emergency Department Information System (EDIS) and the Queensland Injury Surveillance Unit (QISU) for all injured infants between 2009 and 2011. A basic text search was developed to identify the top five infant products in QISU. Sensitivity, specificity, and positive predictive value were calculated and a refined search was used with EDIS. Results were manually reviewed to assess validity. Descriptive analysis was conducted to examine patterns between datasets. - Results The basic text search for all products showed high sensitivity and specificity, and most searches showed high positive predictive value. EDIS patterns were similar to QISU patterns with strikingly similar month-of-age injury peaks, admission proportions and types of injuries. - Conclusions This study demonstrated a capacity to identify a sample of valid cases of product-related injuries for specified products using simple text searching of routine ED data. - Implications As the capacity for large datasets grows and the capability to reliably mine text improves, opportunities for expanded sources of injury surveillance data increase. This will ultimately assist stakeholders such as consumer product safety regulators and child safety advocates to appropriately target prevention initiatives.
Resumo:
Query focused summarization is the task of producing a compressed text of original set of documents based on a query. Documents can be viewed as graph with sentences as nodes and edges can be added based on sentence similarity. Graph based ranking algorithms which use 'Biased random surfer model' like topic-sensitive LexRank have been successfully applied to query focused summarization. In these algorithms, random walk will be biased towards the sentences which contain query relevant words. Specifically, it is assumed that random surfer knows the query relevance score of the sentence to where he jumps. However, neighbourhood information of the sentence to where he jumps is completely ignored. In this paper, we propose look-ahead version of topic-sensitive LexRank. We assume that random surfer not only knows the query relevance of the sentence to where he jumps but he can also look N-step ahead from that sentence to find query relevance scores of future set of sentences. Using this look ahead information, we figure out the sentences which are indirectly related to the query by looking at number of hops to reach a sentence which has query relevant words. Then we make the random walk biased towards even to the indirect query relevant sentences along with the sentences which have query relevant words. Experimental results show 20.2% increase in ROUGE-2 score compared to topic-sensitive LexRank on DUC 2007 data set. Further, our system outperforms best systems in DUC 2006 and results are comparable to state of the art systems.
Resumo:
A dissertação trata sobre o Direito dos Investimentos. O texto busca reconhecer os princípios mais importantes do Direito dos Investimentos, bem como verificar como esta disciplina tem sido aplicada a área de energia. Como estudo de casos, será analisado o mercado brasileiro de biocombustíveis. Em uma primeira parte o texto aborda o histórico e desenvolvimento do Direito dos Investimentos, demonstrando as modificações ocorridas ao longo do tempo e apresentando os principais questionamentos e tendências adotadas durante esta jornada. O segundo capítulo trata sobre os princípios do Direito dos Investimentos, suas aplicações e algumas controvérsias acerca da aplicação destes. O terceiro capítulo aborda o Direito dos Investimentos em matéria de energia, destacando definições, tendências atuais e casos relevantes para o International Energy Law, que se relacionam intimamente a atuação de investidores estrangeiros. Finalmente, no último capítulo o mercado brasileiro de biocombustíveis será analisado sob o prisma dos conceitos trabalhados nos capítulos anteriores, com foco nas questões relacionadas aos investidores e em que medida a regulação governamental pode ser considerada adequada à luz do Direito dos Investimentos.
Resumo:
O presente trabalho analisa no contexto da regionalização e gestão no espaço turístico, dois processos inerentes ao Programa de Regionalização do Turismo (PRT), a roteirização e a gestão participativa, tendo como lócus de pesquisa o pólo Marajó. O texto procura evidenciar os fundamentos teóricos do espaço turístico, a partir do pensamento de BOULLÓN (1985), associando esta abordagem com as formulações contidas na teoria dos pólos de crescimento (PERROUX, 1967), no aménagement du territoire (ANDRADE, 1971, 1987), na teoria da inovação (PORTER, 1998) e na desenvolvimento local (Almeida, 2002, Barqueiro, 2001, Buarque, 2002, Rodrigues, 1977). Analisa-se as políticas públicas nacionais recentes, o Programa Nacional de Municipalização do Turismo (PNMT), o PRT, além do Plano de Desenvolvimento Turístico do Estado do Pará, identificando a orientação desses documentos técnicos quanto ao processo de desenvolvimento do turismo. Para que fossem alcançados os objetivos propostos foram formuladas questões orientadoras da análise do objeto de estudo, quanto a Roteirização e Gestão Participativa. Na seqüência, cumpriu-se uma etapa de pesquisa, de caráter bibliográfico e documental, que permitiu a construção do referencial teórico. Complementarmente, realizou-se o trabalho de campo com entrevistas, observações, levantamentos de informações na área em estudo. O resultado desse processo foi sistematizado, a partir da interpretação das informações obtidas no trabalho de gabinete e de campo. Consolidadas estas informações, as mesmas foram organizadas em forma de texto analítico, com o suporte de mapas, fotos, quadro e tabelas. Considerando os dois principais objetivos propostos (análise da roteirização e da gestão participativa no pólo Marajó), observou-se que o trabalho de roteirização na vila do Pesqueiro concentrou-se, basicamente, em dois dos quatro componentes do espaço turístico: a matéria prima e a superestrutura. Os demais componentes (planta turística e infra-estrutura) não receberam atenção adequada. Tais fatos, conjugados a fatores externos ao universo composto pela comunidade local, como a questão do transporte para o Marajó, e a atuação dos agentes de receptivo de Belém - que têm grande influência na comercialização dos roteiros - dificultaram o processo de venda do produto. Quanto a Gestão Participativa , observou-se que foram instituídos dois comitês gestores, um em Soure e outro em Salvaterra, dissociados do processo de constituição do Fórum Regional de Turismo do Pólo Marajó. Além disso, a estrutura do Fórum não obedeceu ao princípio da equidade, em termos de representação dos atores, e a concepção da governança regional do turismo ocorreu após o processo de roteirização do Amazônia do Marajó. Esses elementos indicam que ainda não se pratica, concretamente, uma gestão participativa no pólo Marajó.
Resumo:
The paper presents the main results of an ongoing project aimed at the development of technologies for digitization of Bulgarian folk music and building a heterogeneous digital library with Bulgarian folk songs presented with their music, notes and text. An initial digitization and preservation of the Bulgarian cultural heritage starts by means of digitization and insertion into the library of over 1000 songs that were recorded and written down during the 60s and 70s of XX century. Also we present a full text search engine in a collection of lyrics (text of songs) and coded notes (symbolic melody). Some perspectives for future projects are also discussed.