SIGIR is the major international forum for the presentation of new research results and the demonstration of new systems and techniques in the field of information retrieval.
The following articles are from "Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval":
Query-expansion is an effective Relevance Feedback technique for improving performance in Information Retrieval. In general query-expansion methods select terms from the complete contents of relevant documents. One problem with this approach is that expansion terms unrelated to document relevance can be introduced into the modified query due to their presence in the relevant documents and distribution in the document collection. Motivated by the hypothesis that query-expansion terms should only be sought from the most relevant areas of a document, this investigation explores the use of document summaries in query-expansion. The investigation explores the use of both context-independent standard summaries and query-biased summaries. Experimental results using the Okapi BM25 probabilistic retrieval model with the TREC-8 ad hoc retrieval task show that query-expansion using document summaries can be considerably more effective than using full-document expansion. The paper also presents a novel approach to term-selection that separates the choice of relevant documents from the selection of a pool of potential expansion terms. Again, this technique is shown to be more effective that standard methods.
We discuss technology to help a person monitor changes in news coverage over time. We define temporal summaries of news stories as extracting a single sentence from each event within a news topic, where the stories are presented one at a time and sentences from a story must be ranked before the next story can be considered. We explain a method for evaluation, and describe an evaluation corpus that we have built. We also propose several methods for constructing temporal summaries and evaluate their effectiveness in comparison to degenerate cases. We show that simple approaches are effective, but that the problem is far from solved.
This work proposes and evaluates a probabilistic cross-lingual retrieval system. The system uses a generative model to estimate the probability that a document in one language is relevant, given a query in another language. An important component of the model is translation probabilities from terms in documents to terms in a query. Our approach is evaluated when 1) the only resource is a manually generated bilingual word list, 2) the only resource is a parallel corpus, and 3) both resources are combined in a mixture model. The combined resources produce about 90% of monolingual performance in retrieving Chinese documents. For Spanish the system achieves 85% of monolingual performance using only a pseudo-parallel Spanish-English corpus. Retrieval results are comparable with those of the structural query translation technique (Pirkola, 1998) when bilingual lexicons are used for query translation. When parallel texts in addition to conventional lexicons are used, it achieves better retrieval results but requires more computation than the structural query translation technique. It also produces slightly better results than using a machine translation system for CLIR, but the improvement over the MT system is not significant.
We present a framework for information retrieval that combines document models and query models using a probabilistic ranking function based on Bayesian decision theory. The framework suggests an operational retrieval model that extends recent developments in the language modeling approach to information retrieval. A language model for each document is estimated, as well as a language model for each query, and the retrieval problem is cast in terms of risk minimization. The query language model can be exploited to model user preferences, the context of a query, synonomy and word senses. While recent work has incorporated word translation models for this purpose, we introduce a new method using Markov chains defined on a set of documents to estimate the query models. The Markov chain method has connections to algorithms from link analysis and social networks. The new approach is evaluated on TREC collections and compared to the basic language modeling approach and vector space models together with query expansion using Rocchio. Significant improvements are obtained over standard query expansion methods for strong baseline TF-IDF systems, with the greatest improvements attained for short queries on Web data.
We explore the relation between classical probabilistic models of information retrieval and the emerging language modeling approaches. It has long been recognized that the primary obstacle to effective performance of classical models is the need to estimate a relevance model: probabilities of words in the relevant class. We propose a novel technique for estimating these probabilities using the query alone. We demonstrate that our technique can produce highly accurate relevance models, addressing important notions of synonymy and polysemy. Our experiments show relevance models outperforming baseline language modeling systems on TREC retrieval and TDT tracking tasks. The main contribution of this work is an effective formal method for estimating a relevance model with no training data.
This paper develops a theoretical learning model of text classification for Support Vector Machines (SVMs). It connects the statistical properties of text-classification tasks with the generalization performance of a SVM in a quantitative way. Unlike conventional approaches to learning text classifiers, which rely primarily on empirical evidence, this model explains why and when SVMs perform well for text classification. In particular, it addresses the following questions: Why can support vector machines handle the large feature spaces in text classification effectively? How is this related to the statistical properties of text? What are sufficient conditions for applying SVMs to text-classification problems successfully?
Thresholding strategies in automated text categorization are an underexplored area of research. This paper presents an examination of the effect of thresholding strategies on the performance of a classifier under various conditions. Using k-Nearest Neighbor (kNN) as the classifier and five evaluation benchmark collections as the test best, three common thresholding methods were investigated, including rank-based thresholding (RCut), proportion-based assignments (PCut) and score-based local optimization (SCut); in addition, new variants of these methods are proposed to overcome significant problems in the existing approaches. Experimental results show that the choice of thresholding strategy can significantly influence the performance of kNN, and that the ``optimal'' strategy may vary by application. SCut is potentially better for fine-tuning but risks overfitting. PCut copes better with rare categories and exhibits a smoother trade-off in recall versus precision, but is not suitable for online decision making. RCut is most natural for online response but is too coarse-grained for global or local optimization. RTCut, a new method combining the strength of category ranking and scoring, outperforms both PCut and RCut significantly.
We describe a text categorization approach that is based on a combination of feature distributional clusters with a support vector machine (SVM) classifier. Our feature selection approach employs distributional clustering of words via the recently introduced information bottleneck method, which generates a more efficient word-cluster representation of documents. Combined with the classification power of an SVM, this method yields high performance text categorization that can outperform other recent methods in terms of categorization accuracy and representation efficiency. Comparing the accuracy of our method with other techniques, we observe significant dependency of the results on the data set. We discuss the potential reasons for this dependency.
We consider the problem of creating document representations in which inter-document similarity measurements correspond to semantic similarity. We first present a novel subspace-based framework for formalizing this task. Using this framework, we derive a new analysis of Latent Semantic Indexing(LSI), showing a precise relationship between its performance and the uniformityof the underlying distribution of documents over topics. This analysis helps explain the improvements gained by Ando's (2000) Iterative Residual Rescaling (IRR) algorithm: IRR can compensate for distributional non-uniformity. A further benefit of our framework is that it provides a well-motivated, effective method for automatically determining the rescaling factor IRR depends on, leading to further improvements. A series of experiments over various settings and with several evaluation metrics validates our claims.
The emergence of XML as a standard interchange format for structured documents/data has given rise to many XML query language proposals. However, some of these languages do not support information retrieval-style ranked queries based on textual similarity. There have been several extensions to these query languages to support keyword search, but the resulting query languages cannot express queries such as``find books and CDs with similar titles''. Either these extensions use keywords as mere boolean filters, or similarities can be calculated only between data values and constants rather than two data values. We propose ELIXIR, an expressive and efficient language for XML information retrieval that extends the query language XML-QL [6],[7] with a textual similarity operator. ELIXIR is a general-purpose XML information retrieval language, sufficiently expressive to handle the above query. Our algorithm for answering ELIXIR queries rewrites the original ELIXIR query into a series of XML-QL queries that generate intermediate relational data, and uses relational database techniques to efficiently evaluate the similarity operators on this intermediate data, yielding an XML document with nodes ranked by similarity. Our experiments demonstrate that our prototype scales well with the size of the XML data and complexity of the query.
In this paper we report on a series of experiments designed to investigate query modification techniques motivated by the area of abductive reasoning. In particular we use the notion of abductive explanation, explanations being a description of data that highlight important features of the data. We describe several methods of creating abductive explanations, exploring term reweighting and query reformulation techniques and demonstrate their suitability for relevance feedback.
In this paper, we propose two generic text summarization methods that create text summaries by ranking and extracting sentences from the original documents. The first method uses standard IR methods to rank sentence relevances, while the second method uses the latent semantic analysis technique to identify semantically important sentences, for summary creations. Both methods strive to select sentences that are highly ranked and different from each other. This is an attempt to create a summary with a wider coverage of the document's main content and less redundancy. Performance evaluations on the two summarization methods are conducted by comparing their summarization outputs with the manual summaries generated by three independent human evaluators. The evaluations also study the influence of different VSM weighting schemes on the text summarization performances. Finally, the causes of the large disparities in the evaluators' manual summarization results are investigated, and discussions on human text summarization patterns are presented.
This paper examines the use of generic summaries for indexing in information retrieval. Our main observations are that: (1) With or without pseudo-relevance feedback, a summary index may be as effective as the corresponding fulltext index for precision-oriented search of highly relevant documents. But a
Automatic summarization of open domain spoken dialogues is a new research area. This paper introduces the task, the challenges involved, and presents an approach to obtain automatic extract summaries for multi-party dialogues of four different genres, without any restriction on domain. We address the following issues which are intrinsic to spoken dialogue summarization and typically can be ignored when summarizing written text such as newswire data: (i) detection and removal of speech disfluencies; (ii) detection and insertion of sentence boundaries; (iii) detection and linking of cross-speaker information units (question-answer pairs). A global system evaluation using a corpus of 23 relevance annotated dialogues containing 80 topical segments shows that for the two more informal genres, our summarization system using dialogue specific components significantly outperforms a baseline using TFIDF term weighting with maximum marginal relevance ranking (MMR).
Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coarse grain, with whole pages as single nodes. Such models may lose vital details in the markup tag structure of the pages, and thus lead to a tightly linked irrelevant subgraph winning over a relatively sparse relevant subgraph, a phenomenon called topic drift or contamination. The problem gets especially severe in the face of increasingly complex pages with navigation panels and advertisement links. We present an enhanced topic distillation algorithm which analyzes text, the markup tag trees that constitute HTML pages, and hyperlinks between pages. It thereby identifies subtrees which have high text- and hyperlink-based coherence w.r.t. the query. These subtrees get preferential treatment in the mutual reinforcement process. Using over 50 queries, 28 from earlier topic distillation work, we analyzed over 700,000 pages and obtained quantitative and anecdotal evidence that the new algorithm reduces topic drift.
Typically, commercial Web search engines provide very little feedback to the user concerning how a particular query is processed and interpreted. Specifically, they apply key query transformations without the users knowledge. Although these transformations have a pronounced effect on query results, users have very few resources for recognizing their existence and understanding their practical importance. We conducted a user study to gain a better understanding of users knowledge of and reactions to the operation of several query transformations that web search engines automatically employ. Additionally, we developed and evaluated Transparent Queries, a software system designed to provide users with lightweight feedback about opaque query transformations. The results of the study suggest that users do indeed have difficulties understanding the operation of query transformations without additional assistance. Finally, although transparency is helpful and valuable, interfaces that allow direct control of query transformations might ultimately be more helpful for end-users.
Much system-oriented evaluation of information retrieval systems has used the Cranfield approach based upon queries run against test collections in a batch mode. Some researchers have questioned whether this approach can be applied to the real world, but little data exists for or against that assertion. We have studied this question in the context of the TREC Interactive Track. Previous results demonstrated that improved performance as measured by relevance-based metrics in batch studies did not correspond with the results of outcomes based on real user searching tasks. The experiments in this paper analyzed those results to determine why this occurred. Our assessment showed that while the queries entered by real users into systems yielding better results in batch studies gave comparable gains in ranking of relevant documents for those users, they did not translate into better performance on specific tasks. This was most likely due to users being able to adequately find and utilize relevant documents ranked further down the output list.
Content Based Image Retrieval (CBIR) presents special challenges in terms of how image data is indexed, accessed, and how end systems are evaluated. This paper discusses the design of a CBIR system that uses global colour as the primary indexing key, and a user centered evaluation of the systems visual search tools. The results indicate that users are able to make use of a range of visual search tools, and that different tools are used at different points in the search process. The results also show that the provision of a structured navigation and browsing tool can support image retrieval, particularly in situations in which the user does not have a target image in mind. The results are discussed in terms of their implications for the design of visual search tools, and their implications for the use of user-centered evaluation for CBIR systems.
Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies to prioritize the pages to be indexed. The issue is even more important for topic-specific search engines, where crawlers must make additional decisions based on the relevance of visited pages. However, it is difficult to evaluate alternative crawling strategies because relevant sets are unknown and the search space is changing. We propose three different methods to evaluate crawling strategies. We apply the proposed metrics to compare three topic-driven crawling algorithms based on similarity ranking, link analysis, and adaptive agents.
Link-based ranking methods have been described in the literature and applied in commercial Web search engines. However, according to recent TREC experiments, they are no better than traditional content-based methods. We conduct a different type of experiment, in which the task is to find the main entry point of a specific Web site. In our experiments, ranking based on link anchor text is twice as effective as ranking based on document content, even though both methods used the same BM25 formula. We obtained these results using two sets of 100 queries on a 18.5 million document set and another set of 100 on a 0.4 million document set. This site finding effectiveness begins to explain why many search engines have adopted link methods. It also opens a rich new area for effectiveness improvement, where traditional methods fail.
The Kleinberg HITS and the Google PageRank algorithms are eigenvector methods for identifying ``authoritative'' or ``influential'' articles, given hyperlink or citation information. That such algorithms should give reliable or consistent answers is surely a desideratum, and in [10], we analyzed when they can be expected to give stable rankings under small perturbations to the linkage patterns. In this paper, we extend the analysis and show how it gives insight into ways of designing stable link analysis methods. This in turn motivates two new algorithms, whose performance we study empirically using citation data and web hyperlink data.
The paper presents a novel approach to unsupervised text summarization. The novelty lies in exploiting the diversity of concepts in text for summarization, which has not received much attention in the summarization literature. A diversity-based approach here is a principled generalization of Maximal Marginal Relevance criterion by Carbonell and Goldstein. We propose, in addition, an information-centric approach to evaluation, where the quality of summaries is judged not in terms of how well they match human-created summaries but in terms of how well they represent their source documents in IR tasks such document retrieval and text categorization. To find the effectiveness of our approach under the proposed evaluation scheme, we set out to examine how a system with the diversity functionality performs against one without, using the BMIR-J2 corpus, a test data developed by a Japanese research consortium. The results demonstrate a clear superiority of a diversity based approach to a non-diversity based approach.
In this paper the score distributions of a number of text search engines are modeled. It is shown empirically that the score distributions on a per query basis may be fitted using an exponential distribution for the set of non-relevant documents and a normal distribution for the set of relevant documents. Experiments show that this model fits TREC-3 and TREC-4 data for not only probabilistic search engines like INQUERY but also vector space search engines like SMART for English. We have also used this model to fit the output of other search engines like LSI search engines and search engines indexing other languages like Chinese. It is then shown that given a query for which relevance information is not available, a mixture model consisting of an exponential and a normal distribution can be fitted to the score distribution. These distributions can be used to map the scores of a search engine to probabilities. We also discuss how the shape of the score distributions arise given certain assumptions about word distributions in documents. We hypothesize that all 'good' text search engines operating on any language have similar characteristics. This model has many possible applications. For example, the outputs of different search engines can be combined by averaging the probabilities (optimal if the search engines are independent) or by using the probabilities to select the best engine for each query. Results show that the technique performs as well as the best current combination techniques.
Given the ranked lists of documents returned by multiple search engines in response to a given query, the problem of metasearch is to combine these lists in a way which optimizes the performance of the combination. This paper makes three contributions to the problem of metasearch: (1) We describe and investigate a metasearch model based on an optimal democratic voting procedure, the Borda Count; (2) we describe and investigate a metasearch model based on Bayesian inference; and (3) we describe and investigate a model for obtaining upper bounds on the performance of metasearch algorithms. Our experimental results show that metasearch algorithms based on the Borda and Bayesian models usually outperform the best input system and are competitive with, and often outperform, existing metasearch strategies. Finally, our initial upper bounds demonstrate that there is much to learn about the limits of the performance of metasearch.
The thresholding of document scores has proved critical for the effectiveness of classification tasks. We review the most important approaches to thresholding, and introduce thescore-distributional (S-D) threshold optimization method. The method is based on score distributions and is capable of optimizing any effectiveness measure defined in terms of the traditional contingency table. As a byproduct, we provide a model for score distributions, and demonstrate its high accuracy in describing empirical data. The estimation method can be performed incrementally, a highly desirable feature for adaptive environments. Our work in modeling score distributions is useful beyond threshold optimization problems. It directly applies to other retrieval environments that make use of score distributions, e.g., distributed retrieval, or topic detection and tracking. The most accurate version of S-D thresholding -- although incremental -- can be computationally heavy. Therefore, we also investigate more practical solutions. We suggest practical approximations and discuss adaptivity, threshold initialization, and incrementality issues. The practical version of S-D thresholding has been tested in the context of the TREC-9 Filtering Track and found to be very effective [2].
Information filtering systems based on statistical retrieval models usually compute a numeric score indicating how well each document matches each profile. Documents with scores above profile-specific dissemination threshold sare delivered. An optimal dissemination threshold is one that maximizes a given utility function based on the distributions of the scores of relevant and non-relevant documents. The parameters of the distribution can be estimated using relevance information, but relevance information obtained while filtering is biased. This paper presents a new method of adjusting dissemination thresholds that explicitly models and compensates for this bias. The new algorithm, which is based on the Maximum Likelihood principle, jointly estimates the parameters of the density distributions for relevant and non-relevant documents and the ratio of the relevant document in the corpus. Experiments with TREC-8 and TREC-9 Filtering Track data demonstrate the effectiveness of the algorithm.
We investigate a meta-model approach, called Meta-learning Using Document Feature characteristics (MUDOF), for the task of automatic textual document categorization. It employs a meta-learning phase using document feature characteristics. Document feature characteristics, derived from the training document set, capture some inherent category-specific properties of a particular category. Different from existing categorization methods, MUDOF can automatically recommend a suitable algorithm for each category based on the category-specific statistical characteristics. Hence, different algorithms may be employed for different categories. Experiments have been conducted on a real-world document collection demonstrating the effectiveness of our approach. The results confirm that our meta-model approach can exploit the advantage of its component algorithms, and demonstrate a better performance than existing algorithms.
We investigate important differences between two styles of document clustering in the context of Topic Detection and Tracking. Converting a Topic Detection system into a Topic Tracking system exposes fundamental differences between these two tasks that are important to consider in both the design and the evaluation of TDT systems. We also identify features that can be used in systems for both tasks.
In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of prospective indications of the importance of a time-sensitive document, for the purpose of producing better document filtering or ranking. By prospective, we mean importance that could be assessed by actions that occur in the future. For example, a news story may be assessed (retrospectively) as being important, based on events that occurred after the story appeared, such as a stock price plummeting or the issuance of many follow-up stories. If a system could anticipate (prospectively) such occurrences, it could provide a timely indication of importance. Clearly, perfect prescience is impossible. However, sometimes there is sufficient correlation between the content of an information item and the events that occur subsequently. We describe a process for creating and evaluating approximate information-triage procedures that are based on prospective indications. Unlike many information-retrieval applications for which document labeling is a laborious, manual process, for many prospective criteria it is possible to build very large, labeled, training corpora automatically. Such corpora can be used to train text classification procedures that will predict the (prospective) importance of each document. This paper illustrates the process with two case studies, demonstrating the ability to predict whether a news story will be followed by many, very similar news stories, and also whether the stock price of one or more companies associated with a news story will move significantly following the appearance of that story. We conclude by discussing how the comprehensibility of the learned classifiers can be critical to success.
This paper presents an informational inference mechanism realized via the use of a high dimensional conceptual space. More specifically, we claim to have operationalized important aspects of Gardenforss recent three-level cognitive model. The connectionist level is primed with the Hyperspace Analogue to Language (HAL) algorithm which produces vector representations for use at the conceptual level. We show how inference at the symbolic level can be implemented by employing Barwise and Seligmans theory of information flow. This article also features heuristics for enhancing HAL-based representations via the use of quality properties, determining concept inclusion and computing concept composition. The worth of these heuristics in underpinning informational inference are demonstrated via a series of experiments. These experiments, though small in scale, show that informational inference proposed in this article has a very different character to the semantic associations produced by the Minkowski distance metric and concept similarity computed via the cosine coefficient. In short, informational inference generally uncovers concepts that are carried, or, in some cases, implied by another concept, (or combination of concepts).
Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. A core problem in language model estimation is smoothing, which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. In this paper, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collections.
We present a novel probabilistic method for topic segmentation on unstructured text. One previous approach to this problem utilizes the hidden Markov model (HMM) method for probabilistically modeling sequence data [7]. The HMM treats a document as mutually independent sets of words generated by a latent topic variable in a time series. We extend this idea by embedding Hofmann's aspect model for text [5] into the segmenting HMM to form an aspect HMM (AHMM). In doing so, we provide an intuitive topical dependency between words and a cohesive segmentation model. We apply this method to segment unbroken streams of New York Times articles as well as noisy transcripts of radio programs on SpeechBot, an online audio archive indexed by an automatic speech recognition engine. We provide experimental comparisons which show that the AHMM outperforms the HMM for this task.
Hierarchies have long been used for organization, summarization, and access to information. In this paper we define summarization in terms of a probabilistic language model and use the definition to explore a new technique for automatically generating topic hierarchies by applying a graph-theoretic algorithm, which is an approximation of the Dominating Set Problem. The algorithm efficiently chooses terms according to a language model. We compare the new technique to previous methods proposed for constructing topic hierarchies including subsumption and lexical hierarchies, as well as the top TF.IDF terms. Our results show that the new technique consistently performs as well as or better than these other techniques. They also show the usefulness of hierarchies compared with a list of terms.
In this paper we present the features of a Question/Answering (Q/A) system that had unparalleled performance in the TREC-9 evaluations. We explain the accuracy of our system through the unique characteristics of its architecture: (1) usage of a wide-coverage answer type taxonomy; (2) repeated passage retrieval; (3) lexico-semantic feedback loops; (4) extraction of the answers based on machine learning techniques; and (5) answer caching. Experimental results show the effects of each feature on the overall performance of the Q/A system and lead to general conclusions about Q/A from large text collections.
There are many tasks that require information finding. Some can be largely automated, and others greatly benefit from successful interaction between system and searcher. We are interested in the task of answering questions where some synthesis of information is required-the answer would not generally be given from a single passage of a single document. We investigate whether variation in the way a list of documents is delivered affected searcher performance in the question answering task. We will show that there is a significant difference in performance using a list customized to the task type, compared with a standard web-engine list. This indicates that paying attention to the task and the searcher interaction may provide substantial improvement in task performance.
This paper presents a novel information retrieval system that includes 1) the addition of concepts to facilitate the identification of the correct word sense, 2) a natural language query interface, 3) the inclusion of weights and penalties for proper nouns that build upon the Okapi weighting scheme, and 4) a term clustering technique that exploits the spatial proximity of search terms in a document to further improve the performance. The effectiveness of the system is validated by experimental results.
We investigate the performance of metasearch algorithms in terms of how much they improve consistency. We find that three different metasearch algorithms, each over three datasets, usually improve the consistency of search results; sometimes the improvement is dramatic. Furthermore, consistency tends to improve when performance improves.
This paper presents an approach to automatically extracting the bilingual translations of many Web query terms through mining the Web anchor texts. Some preliminary experiments are conducted on using 109,416 Web pages containing both Chinese and English anchor texts in their in-links to extract Chinese translations of 200 English queries selected from popular query terms in Taiwan. It is found that the effective translations of 75% of the popular query terms can be extracted, in which 87.2% cannot be obtained in common translation dictionaries.
Many problems are difficult to adequately explore until a prototype exists in order to elicit user feedback. One such problem is a system that automatically categorizes and manages email. Due to a myriad of user interface issues, a prototype is necessary to determine what techniques and technologies are effective in the email domain. This paper describes the implementation of an add-in for Microsoft Outlook 2000 TM that intends to address two problems with email: 1) help manage the inbox by automatically classifying email based on user folders, and 2) to aid in search and retrieval by providing a list of email relevant to the selected item. This add-in represents a first step in an experimental system for the study of other issues related to information management. The system has been set up to allow experimentation with other classification algorithms and the source code is available online in an effort to promote further experimentation.
Previous research has shown that citations and hypertext links can be usefully combined with document content to improve retrieval. Links can be used in many ways, e.g., link topology can be used to identify important pages, anchor text can be used to augment the text of cited pages, and activation can be spread to linked pages. This paper introduces a probabilistic model that integrates content matching and these three uses of link information in a single unified framework. Experiments with a web collection show benefits for link information especially for general queries.
algorithm for the segmentation of an audio/video source into topically cohesive segments based on automatic speech recognition (ASR) transcriptions is presented. A novel two-pass algorithm is described that combines a boundary-based method with a content-based method. In the first pass, the temporal proximity and the rate of arrival of ngram features is analyzed in order to compute an initial segmentation. In the content- based second pass, changes in content-bearing words are detected by using the ngram features as queries in an information-retrieval system. The second pass validates the initial segments and merges them as needed. Feasibility of the segmentation task can vary enormously depending on the structure of the audio content, and the accuracy of ASR. For real-world corporate training data our method identifies, at worst, a single salient segment of the audio and, at best, a high-level table-of-contents. We illustrate the algorithm in detail with some examples and validate the results with segmentation boundaries generated manually.
A sentence extract summary of a document is a subset of the document's sentences that contains the main ideas in the document. We present an approach to generating such summaries, a hidden Markov model that judges the likelihood that each sentence should be contained in the summary. We compare the results of this method with summaries generated by humans, showing that we obtain significantly higher agreement than do earlier methods.
We present performance measurement results for a parallel SQL based information retrieval system implemented on a PC cluster system. We used the Web-TREC dataset under a left-deep query execution plan. We achieved satisfactory speed up.
Topic segmentation is an important initial step in many text-based tasks. A hierarchical representation of a texts topics is useful in retrieval and allows judging relevancy at different levels of detail. This short paper describes research on generic algorithms for topic detection and segmentation that are applicable on texts of heterogeneous types and domains.
A method of assisting a user in finding the required documents effectively is proposed. A user being informed which documents are worth examining can browse in a digital library (DL) in a linear fashion. Computational evaluations were carried out, and a DL and its navigator are designed and constructed.
We introduce static index pruning methods that significantly reduce the index size in information retrieval systems. We investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. In uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. In term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. We give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. We present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index.
Many web pages have implicit structure. In this paper, we show the feasibility of automatically extracting data from web pages by using approximate matching techniques. This can be applied to generate automatic wrappers or to notify/display web page differences, web page change monitoring, etc.
The Web is a valuable source of language specific resources but collecting, organizing and utilizing this information is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries to collect documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant and a subset of documents is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for finding documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Applying the same approach to multiple languages show that our system generalizes to a variety of languages.
An ideal retrieval system should retrieve images that satisfy the user's need, and should, therefore, measure image similarity in a manner consistent with human's perception. However, existing computational similarity measures are not perceptually consistent. This paper proposes an approach of improving retrieval performance by improving the perceptual consistency of computational similarity measures for textures based on relevance feedback judgments.
This paper introduces a personalization method for image retrieval based on the learning of personal preferences. The proposed system indexes objects based on shape and groups them into a set of clusters, or prototypes. Our personalization method refines corresponding prototypes from objects provided by the user in the foreground, and simultaneously adapts the database index in the background.
Web document hierarchical classification approaches often rely on textual features alone even though web pages include multimedia data. We propose a new hierarchical integrated web classification approach that combines image-based and text-based approaches. Instead of using a flat classifier to combine text and image classification, we perform classification on a hierarchy differently on different levels of the tree, using text for branches and images only at leaves. The results of our experiments show that the use of the hierarchical structure improved web document classification performance significantly.
Query clustering is crucial for automatically discovering frequently asked queries (FAQs) or most popular topics on a question-answering search engine. Due to the short length of queries, the traditional approaches based on keywords are not suitable for query clustering. This paper describes our attempt to cluster similar queries according to their contents as well as the document click information in the user logs.
In this demonstration, we present interoperable and personalized search services for the Networked Digital Library of Theses and Dissertations (NDLTD). Using standard protocols and software, including those specified by the Open Archives Initiative (OAI), distributed sites can share metadata easily. On top of these harvesting protocols, we implement a union collection of theses managed by the MARIAN digital library system. Our demonstration covers aspects of NDLTD, OAI, and MARIAN.
This demonstration will show describe the construction and application of Cross-Domain Information Servers using features of the standard Z39.50 information retrieval protocol[Z39.50]. The system is currently being used to build and search distributed indexes for databases with disparate structured data (SGML and XML). We use the Z39.50 Explain Database to determine the databases and indexes of a given server, then use the Z39.50 SCAN facility to extract the contents of the indexes. This information is used to build collection documents that can be retrieved using probabilistic retrieval algorithms.
In 1999 a directed query distributed search engine was integrated into a new Department of Energy Virtual Library of Energy Science and Technology. Millions of pages of government information across multiple agencies were made immediately searchable via one query, setting the stage for the development of a variety of interagency initiatives and applications.
MS Read is a prototype application implemented as an extension of the Web Browser that creates an evolving model of the users topic of interest. It uses that model to analyze documents that are accessed while searching and browsing the Web. In the presented version of MS Read the model is used to highlight topic related terminology in the documents. MS Read model of the user need is created by applying natural language processing to search queries captured within the Browser and to topic descriptions explicitly provided by the user while browsing and reading documents. It is semantically enhanced using linguistic and custom knowledge resources.
Automatic albuming -- the automatic organization of photographs, either as an end in itself or for use in other applications -- is an application that promises to be of great assistance to photographers. Relatively sophisticated image content analysis techniques have been used for image indexing, organization and retrieval. In this paper, we describe a method of organizing photographs into events using spoken photograph captions. The results of this process can be used to improve image indexing and retrieval.
The most prevalent experimental methodology for comparing the effectiveness of information retrieval systems requires a test collection, composed of a set of documents, a set of query topics, and a set of relevance judgments indicating which documents are relevant to which topics. It is well known that relevance judgments are not infallible, but recent retrospective investigation into results from the Text REtrieval Conference (TREC) has shown that differences in human judgments of relevance do not affect the relative measured performance of retrieval systems. Based on this result, we propose and describe the initial results of a new evaluation methodology which replaces human relevance judgments with a randomly selected mapping of documents to topics which we refer to as pseudo-relevance judgments. Rankings of systems with our methodology correlate positively with official TREC rankings, although the performance of the top systems is not predicted well. The correlations are stable over a variety of pool depths and sampling techniques. With improvements, such a methodology could be useful in evaluating systems such as World-Wide Web search engines, where the set of documents changes too often to make traditional collection construction techniques practical.
In this paper, we present a method that can automatically evaluate performance of different term weighting schemes in information retrieval without resorting to precision-recall based on human relevance judgments. Specifically, the problem is: given two document-term matrixes generated from two different term weighting schemes, can we tell which term weighting scheme will performance better than the other? We propose a meta-scoring function, which takes as input the document-term matrix generated by some term weighting scheme and computes a goodness score from the document-term matrix. In our experiments, we found out that this score is highly correlated with the precision-recall measurement for all the collections and term weighting schema we tried. Thus, we conclude that our meta-scoring function can be a substitute for the precision-recall measurement that needs relevance judgments of human subject. Furthermore, this meta-scoring function is not limited only to text information retrieval can be applied to fields such as image and DNA retrieval.
Most approaches to cross language information retrieval assume that resources providing a direct translation between the query and document languages exist. This paper presents research examining the situation where such an assumption is false. Here, an intermediate (or pivot) language provides a means of transitive translation of the query language to that of the document via the pivot, at the cost, however, of introducing much error. The paper reports the novel approach of translating in parallel across multiple intermediate languages and fusing the results. Such a technique removes the error, raising the effectiveness of the tested retrieval system, up to and possibly above the level expected, had a direct translation route existed. Across a number of retrieval situations and combinations of languages, the approach proves to be highly effective.
Dictionaries have often been used for query translation in cross-language information retrieval (CLIR). However, we are faced with the problem of translation ambiguity, i.e. multiple translations are stored in a dictionary for a word. In addition, a word-by-word query translation is not precise enough. In this paper, we explore several methods to improve the previous dictionary-based query translation. First, as many as possible, noun phrases are recognized and translated as a whole by using statistical models and phrase translation patterns. Second, the best word translations are selected based on the cohesion of the translation words. Our experimental results on TREC English-Chinese CLIR collection show that these techniques result in significant improvements over the simple dictionary approaches, and achieve even better performance than a high-quality machine translation system.
Give us your opinion! Do you have any comments/additions that you would like other visitors to see?
Yousay:
Mar 20th, 2010
#1
Be the first to add a thoughtful note to this page !
Changes to this page (conference)
24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was added to the bibliography
Computer programs emerge as the outcome of complex human processes of cognition, communication and negotiation, which serve to establish the meaningful embedding of the computer system in its intended use context.
Page maintainer: The Editorial Team How to cite/reference this page
URL: http://www.interaction-design.org/references/conferences/proceedings_of_the_24th_annual_international_acm_sigir_conference_on_research_and_development_in_information_retrieval.html