SIGIR is the major international forum for the presentation of new research results and the demonstration of new systems and techniques in the field of information retrieval.
The following articles are from "Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval":
This acceptance talk is a curious mixture of personal history and developing ideas in the context of the growing field of IR covering several decades. I want to concentrate on models and theories, interpreted loosely, and try and give an insight into where I have got to in my thinking, where the ideas came from, and where I believe we are going. In the last few years I have been working on the development of what might be coined as a design language for IR. It takes its inspiration from Quantum Mechanics, but by analogy only. The mathematical objects represent documents; these objects might be vectors (or density operators) in an n-dimensional vector space (usually a Hilbert space). A request for information, or a query, is taken as an observable and is represented as a linear operator on the space. Linear operators can be expressed as matrices. Such an operator, Hermitian, has a set of eigenvectors forming a basis for the space; which we interpret as a point of view or perspective from which to understand the space. Thus any document-vector can be located with respect to the basis, and we can calculate an inner product between such a vector and any basis vector, which may be interpreted as a probability of relevance. The probability of observing any given eigenvector is now given by the square of that inner product assuming all vectors are normalised. Hence we connect the probability of observation to the geometry of the space. Furthermore, the subspaces of the space make up a lattice structure which is equivalent to a logic. This makes up the entire mathematical structure, and the language for handling this structure is linear algebra: vectors, matrices, projections, inner-products, neatly captured by the Dirac notation used in quantum mechanics. Our probability is slightly different from classical probability, the same for logic; we end up with quantum logic and quantum probability. A commitment to this kind of mathematical structure, with which to model objects and processes in IR, depends on two critical assumptions.
Several recent studies have demonstrated that the type of improvements in information retrieval system effectiveness reported in forums such as SIGIR and TREC do not translate into a benefit for users. Two of the studies used an instance recall task, and a third used a question answering task, so perhaps it is unsurprising that the precision based measures of IR system effectiveness on one-shot query evaluation do not correlate with user performance on these tasks. In this study, we evaluate two different information retrieval tasks on TREC Web-track data: a precision-based user task, measured by the length of time that users need to find a single document that is relevant to a TREC topic; and, a simple recall-based task, represented by the total number of relevant documents that users can identify within five minutes. Users employ search engines with controlled mean average precision (MAP) of between 55% and
A common limitation of many retrieval models, including the recently proposed axiomatic approaches, is that retrieval scores are solely based on exact (i.e., syntactic) matching of terms in the queries and documents, without allowing distinct but semantically related terms to match each other and contribute to the retrieval score. In this paper, we show that semantic term matching can be naturally incorporated into the axiomatic retrieval model through defining the primitive weighting function based on a semantic similarity function of terms. We define several desirable retrieval constraints for semantic term matching and use such constraints to extend the axiomatic model to directly support semantic term matching based on the mutual information of terms computed on some document set. We show that such extension can be efficiently implemented as query expansion. Experiment results on several representative data sets show that, with mutual information computed over the documents in either the target collection for retrieval or an external collection such as the Web, our semantic expansion consistently and substantially improves retrieval accuracy over the baseline axiomatic retrieval model. As a pseudo feedback method, our method also outperforms a state-of-the-art language modeling feedback method.
We show that a set of independently developed spam filters may be combined in simple ways to provide substantially better filtering than any of the individual filters. The results of fifty-three spam filters evaluated at the TREC 2005 Spam Track were combined post-hoc so as to simulate the parallel on-line operation of the filters. The combined results were evaluated using the TREC methodology, yielding more than a factor of two improvement over the best filter. The simplest method -- averaging the binary classifications returned by the individual filters -- yields a remarkably good result. A new method -- averaging log-odds estimates based on the scores returned by the individual filters -- yields a somewhat better result, and provides input to SVM- and logistic-regression-based stacking methods. The stacking methods appear to provide further improvement, but only for very large corpora. Of the stacking methods, logistic regression yields the better result. Finally, we show that it is possible to select a priori small subsets of the filters that, when combined, still outperform the best individual filter by a substantial margin.
Web query classification (QC) aims to classify Web users' queries, which are often short and ambiguous, into a set of target categories. QC has many applications including page ranking in Web search, targeted advertisement in response to queries, and personalization. In this paper, we present a novel approach for QC that outperforms the winning solution of the ACM KDDCUP 2005 competition, whose objective is to classify 800,000 real user queries. In our approach, we first build a bridging classifier on an intermediate taxonomy in an offline mode. This classifier is then used in an online mode to map user queries to the target categories via the above intermediate taxonomy. A major innovation is that by leveraging the similarity distribution over the intermediate taxonomy, we do not need to retrain a new classifier for each new set of target categories, and therefore the bridging classifier needs to be trained only once. In addition, we introduce category selection as a new method for narrowing down the scope of the intermediate taxonomy based on which we classify the queries. Category selection can improve both efficiency and effectiveness of the online classification. By combining our algorithm with the winning solution of KDDCUP 2005, we made an improvement by 9.7% and 3.8% in terms of precision and F1 respectively compared with the best results of KDDCUP 2005.
Data fusion is the combination of the results of independent searches on a document collection into one single output result set. It has been shown in the past that this can greatly improve retrieval effectiveness over that of the individual results. This paper presents probFuse, a probabilistic approach to data fusion. ProbFuse assumes that the performance of the individual input systems on a number of training queries is indicative of their future performance. The fused result set is based on probabilities of relevance calculated during this training process. Retrieval experiments using data from the TREC ad hoc collection demonstrate that probFuse achieves results superior to that of the popular CombMNZ fusion algorithm.
We study the effect of user supplied relevance feedback in improving web search results. Rather than using query refinement or document similarity measures to rerank results, we show that the web-graph distance between two documents is a robust measure of their relative relevancy. We demonstrate how the use of this metric can improve the rankings of result URLs, even when the user only rates one document in the dataset. Our research suggests that such interactive systems can significantly improve search results.
Information retrieval algorithms leverage various collection statistics to improve performance. Because these statistics are often computed on a relatively small evaluation corpus, we believe using larger, non-evaluation corpora should improve performance. Specifically, we advocate incorporating external corpora based on language modeling. We refer to this process as external expansion. When compared to traditional pseudo-relevance feedback techniques, external expansion is more stable across topics and up to 10% more effective in terms of mean average precision. Our results show that using a high quality corpus that is comparable to the evaluation corpus can be as, if not more, effective than using the web. Our results also show that external expansion outperforms simulated relevance feedback. In addition, we propose a method for predicting the extent to which external expansion will improve retrieval performance. Our new measure demonstrates positive correlation with improvements in mean average precision.
Pseudo-relevance feedback has proven to be an effective strategy for improving retrieval accuracy in all retrieval models. However the performance of existing pseudo feedback methods is often affected significantly by some parameters, such as the number of feedback documents to use and the relative weight of original query terms; these parameters generally have to be set by trial-and-error without any guidance. In this paper, we present a more robust method for pseudo feedback based on statistical language models. Our main idea is to integrate the original query with feedback documents in a single probabilistic mixture model and regularize the estimation of the language model parameters in the model so that the information in the feedback documents can be gradually added to the original query. Unlike most existing feedback methods, our new method has no parameter to tune. Experiment results on two representative data sets show that the new method is significantly more robust than a state-of-the-art baseline language modeling approach for feedback with comparable or better retrieval accuracy.
Semantic smoothing, which incorporates synonym and sense information into the language models, is effective and potentially significant to improve retrieval performance. The implemented semantic smoothing models, such as the translation model which statistically maps document terms to query terms, and a number of works that have followed have shown good experimental results. However, these models are unable to incorporate contextual information. Thus, the resulting translation might be mixed and fairly general. To overcome this limitation, we propose a novel context-sensitive semantic smoothing method that decomposes a document or a query into a set of weighted context-sensitive topic signatures and then translate those topic signatures into query terms. In detail, we solve this problem through (1) choosing concept pairs as topic signatures and adopting an ontology-based approach to extract concept pairs; (2) estimating the translation model for each topic signature using the EM algorithm; and (3) expanding document and query models based on topic signature translations. The new smoothing method is evaluated on TREC 2004/05 Genomics Track collections and significant improvements are obtained. The MAP (mean average precision) achieves a 33.6% maximal gain over the simple language model, as well as a 7.8% gain over the language model with context-insensitive semantic smoothing.
The paper is concerned with applying learning to rank to document retrieval. Ranking SVM is a typical method of learning to rank. We point out that there are two factors one must consider when applying Ranking SVM, in general a "learning to rank" method, to document retrieval. First, correctly ranking documents on the top of the result list is crucial for an Information Retrieval system. One must conduct training in a way that such ranked results are accurate. Second, the number of relevant documents can vary from query to query. One must avoid training a model biased toward queries with a large number of relevant documents. Previously, when existing methods that include Ranking SVM were applied to document retrieval, none of the two factors was taken into consideration. We show it is possible to make modifications in conventional Ranking SVM, so it can be better used for document retrieval. Specifically, we modify the "Hinge Loss" function in Ranking SVM to deal with the problems described above. We employ two methods to conduct optimization on the loss function: gradient descent and quadratic programming. Experimental results show that our method, referred to as Ranking SVM for IR, can outperform the conventional Ranking SVM and other existing methods for document retrieval on two datasets.
We show that incorporating user behavior data can significantly improve ordering of top results in real web search setting. We examine alternatives for incorporating feedback into the ranking process and explore the contributions of user feedback compared to other common web search features. We report results of a large scale evaluation over 3,000 queries and 12 million user interactions with a popular web search engine. We show that incorporating implicit feedback can augment other features, improving the accuracy of a competitive web search ranking algorithms by as much as 31% relative to the original performance.
This paper presents a study of three statistical query translation models that use different units of translation. We begin with a review of a word-based translation model that uses co-occurrence statistics for resolving translation ambiguities. The translation selection problem is then formulated under the framework of graphic model resorting to which the modeling assumptions and limitations of the co-occurrence model are discussed, and the research of finding better translation units is motivated. Then, two other models that use larger, linguistically motivated translation units (i.e., noun phrase and dependency triple) are presented. For each model, the modeling and training methods are described in detail. All query translation models are evaluated using TREC collections. Results show that larger translation units lead to more specific models that usually achieve better translation and cross-language information retrieval results.
This paper introduces a general framework for the use of translation probabilities in cross-language information retrieval based on the notion that information retrieval fundamentally requires matching what the searcher means with what the author of a document meant. That perspective yields a computational formulation that provides a natural way of combining what have been known as query and document translation. Two well-recognized techniques are shown to be a special case of this model under restrictive assumptions. Cross-language search results are reported that are statistically indistinguishable from strong monolingual baselines for both French and Chinese documents.
The role of network structure has grown in significance over the past ten years in the field of information retrieval, stimulated to a great extent by the importance of link analysis in the development of Web search techniques [4]. This body of work has focused primarily on the network that is most clearly visible on the Web: the network of hyperlinks connecting documents to documents. But the Web has always contained a second network, less explicit but equally important, and this is the social network on its users, with latent person-to-person links encoding a variety of relationships including friendship, information exchange, and influence. Developments over the past few years -- including the emergence of social networking systems and rich social media, as well as the availability of large-scale e-mail and instant messenging datasets -- have highlighted the crucial role played by on-line social networks, and at the same time have made them much easier to uncover and analyze. There is now a considerable opportunity to exploit the information content inherent in these networks, and this prospect raises a number of interesting research challenge. Within this context, we focus on some recent efforts to formalize the problem of searching a social network. The goal is to capture the issues underlying a variety of related scenarios: a member of a social networking system such as MySpace seeks a piece of information that may be held by a friend of a friend [27, 28]; an employee in a large company searches his or her network of colleagues for expertise in a particular subject [9]; a node in a decentralized peer-to-peer file-sharing system queries for a file that is likely to be a small number of hops away [2, 6, 16, 17]; or a user in a distributed IR or federated search setting traverses a network of distributed resources connected by links that may not just be informational but also economic or contractual [3, 5, 7, 8, 13, 18, 21]. In their most basic forms, these scenarios have some essential features in common: a node in a network, without global knowledge, must find a short path to a desired "target" node (or to one of several possible target nodes). To frame the underlying problem, we go back to one of the most well-known pieces of empirical social network analysis -- Stanley Milgram's research into the small-world phenomenon, also known as the "six degrees of separation" [19, 24, 25]. The form of Milgram's experiments, in which randomly chosen starters had to forward a letter to a designated target individual, established not just that short chains connecting far-flung pairs of people are abundant in large social networks, but also that the individuals in these networks, operating with purely local information about their own friends and acquaintances, are able to actually find these chains [10]. The Milgram experiments thus constituted perhaps the earliest indication that large-scale social networks are structured to support this type of decentralized search. Within a family of random-graph models proposed by Watts and Strogatz [26], we have shown that the ability of a network to support this type of decentralized search depends in subtle ways on how its "long-range" connections are correlated with the underlying spatial or organizational structure in which it is embedded [10, 11]. Recent studies using data on communication within organizations [1] and the friendships within large on-line communities [15] have established the striking fact that real social networks closely match some of the structural features predicted by these mathematical models. If one looks further at the on-line settings that provide the initial motivation for these issues, there is clearly interest from many directions in their long-term economic implications -- essentially, the consequences that follow from viewing distributed information retrieval applications, peer-to-peer systems, or social-networking sites as providing marketplaces for information and services. How does the problem of decentralized search in a network change when the participants are not simply agents following a fixed algorithm, but strategic actors who make decisions in their own self-interest, and may demand compensation for taking part in a protocol? Such considerations bring us into the realm of algorithmic game theory, an active area of current research that uses game-theoretic notions to quantify the performance of systems in which the participants follow their own self-interest [20, 23] In a simple model for decentralized search in the presence of incentives, we find that performance depends crucially on both the rarity of the information and the richness of the network topology [12] -- if the network is too structurally impoverished, an enormous investment may be required to produce a path from a query to an answer.
We present a novel framework for answering complex questions that relies on question decomposition. Complex questions are decomposed by a procedure that operates on a Markov chain, by following a random walk on a bipartite graph of relations established between concepts related to the topic of a complex question and subquestions derived from topic-relevant passages that manifest these relations. Decomposed questions discovered during this random walk are then submitted to a state-of-the-art Question Answering (Q/A) system in order to retrieve a set of passages that can later be merged into a comprehensive answer by a Multi-Document Summarization (MDS) system. In our evaluations, we show that access to the decompositions generated using this method can significantly enhance the relevance and comprehensiveness of summary-length answers to complex questions.
New types of document collections are being developed by various web services. The service providers keep track of non-textual features such as click counts. In this paper, we present a framework to use non-textual features to predict the quality of documents. We also show our quality measure can be successfully incorporated into the language modeling-based retrieval model. We test our approach on a collection of question and answer pairs gathered from a community based question answering service where people ask and answer questions. Experimental results using our quality measure show a significant improvement over our baseline.
Co-occurrence data is quite common in many real applications. Latent Semantic Analysis (LSA) has been successfully used to identify semantic relations in such data. However, LSA can only handle a single co-occurrence relationship between two types of objects. In practical applications, there are many cases where multiple types of objects exist and any pair of these objects could have a pairwise co-occurrence relation. All these co-occurrence relations can be exploited to alleviate data sparseness or to represent objects more meaningfully. In this paper, we propose a novel algorithm, M-LSA, which conducts latent semantic analysis by incorporating all pairwise co-occurrences among multiple types of objects. Based on the mutual reinforcement principle, M-LSA identifies the most salient concepts among the co-occurrence data and represents all the objects in a unified semantic space. M-LSA is general and we show that several variants of LSA are special cases of our algorithm. Experiment results show that M-LSA outperforms LSA on multiple applications, including collaborative filtering, text clustering, and text categorization.
This paper studies the problem of identifying comparative sentences in text documents. The problem is related to but quite different from sentiment/opinion sentence identification or classification. Sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. An important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers' opinions on its products. Comparisons on the other hand can be subjective or objective. Furthermore, a comparison is not concerned with an object in isolation. Instead, it compares the object with others. An example opinion sentence is "the sound quality of CD player X is poor". An example comparative sentence is "the sound quality of CD player X is not as good as that of CD player Y". Clearly, these two sentences give different information. Their language constructs are quite different too. Identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. This paper proposes to study the comparative sentence identification problem. It first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. Experiment results using three types of documents, news articles, consumer reviews of products, and Internet forum postings, show a
Machine learning is the mainstay for text classification. However, even the most successful techniques are defeated by many real-world applications that have a strong time-varying component. To advance research on this challenging but important problem, we promote a natural, experimental framework-the Daily Classification Task-which can be applied to large time-based datasets, such as Reuters RCV1. In this paper we dissect concept drift into three main subtypes. We demonstrate via a novel visualization that the recurrent themes subtype is present in RCV1. This understanding led us to develop a new learning model that transfers induced knowledge through time to benefit future classifier learning tasks. The method avoids two main problems with existing work in inductive transfer: scalability and the risk of negative transfer. In empirical tests, it consistently showed more than 10 points F-measure improvement for each of four Reuters categories tested.
Standard Information Retrieval (IR) metrics assume a simple model where documents are understood as independent units. Such an assumption is not adapted to new paradigms like XML or Web IR where retrievable informations are parts of documents or sets of related documents. Moreover, classical hypotheses assumes that the user ignores the structural or logical context of document elements and hence the possibility of navigation between units. EPRUM is a generalisation of Precision-Recall (PR) that aims at allowing the user to navigate or browse in the corpus structure. Like the Cumulated Gain metrics, it is able to handle continuous valued relevance. We apply and compare EPRUM in the context of XML Retrieval -- a very active field for evaluation metrics. We also explain how EPRUM can be used in other IR paradigms.
Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In this work we link evaluation with test collection construction to gain an understanding of the minimal judging effort that must be done to have high confidence in the outcome of an evaluation. A new way of looking at average precision leads to a natural algorithm for selecting documents to judge and allows us to estimate the degree of confidence by defining a distribution over possible document judgments. A study with annotators shows that this method can be used by a small group of researchers to rank a set of systems in under three hours with 95% confidence. Information retrieval metrics such as average precision require large sets of relevance judgments to be accurately estimated. Building these sets is infeasible and often inefficient for many real-world retrieval implementations. We present a new way of looking at average precision that allows us to estimate the confidence in an evaluation based on the size of the test collection. We use this to build an algorithm for selecting the best documents to judge to have maximum confidence in an evaluation with a minimal number of relevance judgments. A study with annotators shows how the algorithm can be used by a small group of researchers to quickly rank a set of systems with 95% confidence.
Similarity measures for text have historically been an important tool for solving information retrieval problems. In many interesting settings, however, documents are often closely connected to other documents, as well as other non-textual objects: for instance, email messages are connected to other messages via header information. In this paper we consider extended similarity metrics for documents and other objects embedded in graphs, facilitated via a lazy graph walk. We provide a detailed instantiation of this framework for email data, where content, social networks and a timeline are integrated in a structural graph. The suggested framework is evaluated for two email-related problems: disambiguating names in email documents, and threading. We show that reranking schemes based on the graph-walk similarity measures often outperform baseline methods, and that further improvements can be obtained by use of appropriate learning methods.
Existing methods for measuring the quality of search algorithms use a static collection of documents. A set of queries and a mapping from the queries to the relevant documents allow the experimenter to see how well different search engines or engine configurations retrieve the correct answers. This methodology assumes that the document set and thus the set of relevant documents are unchanging. In this paper, we abandon the static collection requirement. We begin with a recent TREC collection created from a web crawl and analyze how the documents in that collection have changed over time. We determine how decay of the document collection affects TREC systems, and present the results of an experiment using the decayed collection to measure a live web search system. We employ novel measures of search effectiveness that are robust despite incomplete relevance information. Lastly, we propose a methodology of "collection maintenance" which supports measuring search performance both for a single system and between systems run at different points in time.
Evaluating user preferences of web search results is crucial for search engine development, deployment, and maintenance. We present a real-world study of modeling the behavior of web search users to predict web search result preferences. Accurate modeling and interpretation of user behavior has important applications to ranking, click spam detection, web search personalization, and other tasks. Our key insight to improving robustness of interpreting implicit feedback is to model query-dependent deviations from the expected "noisy" user behavior. We show that our model of clickthrough interpretation improves prediction accuracy over state-of-the-art clickthrough methods. We generalize our approach to model user behavior beyond clickthrough, which results in higher preference prediction accuracy than models based on clickthrough information alone. We report results of a large-scale experimental evaluation that show substantial improvements over published implicit feedback interpretation methods.
This paper introduces a family of link-based ranking algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with distance, so a direct link implies more endorsement than a link through a long path. PageRank is the most widely known ranking function of this family. The main objective of this paper is to determine whether this family of ranking techniques has some interest per se, and how different choices for the damping function impact on rank quality and on convergence speed. Even though our results suggest that PageRank can be approximated with other simpler forms of rankings that may be computed more efficiently, our focus is of more speculative nature, in that it aims at separating the kernel of PageRank, that is, link-based importance propagation, from the way propagation decays over paths. We focus on three damping functions, having linear, exponential, and hyperbolic decay on the lengths of the paths. The exponential decay corresponds to PageRank, and the other functions are new. Our presentation includes algorithms, analysis, comparisons and experiments that study their behavior under different parameters in real Web graph data. Among other results, we show how to calculate a linear approximation that induces a page ordering that is almost identical to PageRank's using a fixed small number of iterations; comparisons were performed using Kendall's t on large domain datasets.
Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not available, the size of the underlying collections must be estimated. While several approaches for the estimation of collection size have been proposed, their accuracy has not been thoroughly evaluated. An empirical analysis of past estimation approaches across a variety of collections demonstrates that their prediction accuracy is low. Motivated by ecological techniques for the estimation of animal populations, we propose two new approaches for the estimation of collection size. We show that our approaches are significantly more accurate that previous methods, and are more efficient in use of resources required to perform the estimation.
Combining the output from multiple retrieval sources over the same document collection is of great importance to a number of retrieval tasks such as multimedia retrieval, web retrieval and meta-search. To merge retrieval sources adaptively according to query topics, we propose a series of new approaches called probabilistic latent query analysis (pLQA), which can associate non-identical combination weights with latent classes underlying the query space. Compared with previous query independent and query-class based combination methods, the proposed approaches have the advantage of being able to discover latent query classes automatically without using prior human knowledge, to assign one query to a mixture of query classes, and to determine the number of query classes under a model selection principle. Experimental results on two retrieval tasks, i.e., multimedia retrieval and meta-search, demonstrate that the proposed methods can uncover sensible latent classes from training data, and can achieve considerable performance gains.
User modeling for information retrieval has mostly been studied to improve the effectiveness of information access in centralized repositories. In this paper we explore user modeling in the context of full-text federated search in peer-to-peer networks. Our approach models a user's persistent, long-term interests based on past queries, and uses the model to improve search efficiency for future queries that represent interests similar to past queries. Our approach also enables queries representing a user's transient, ad-hoc interests to be automatically recognized so that search for these queries can rely on a relatively large search radius to avoid sacrificing effectiveness for efficiency. Experimental results demonstrate that our approach can significantly improve the efficiency of full-text federated search without degrading its accuracy. Furthermore, the proposed approach does not require a large amount of training data, and is robust to a range of parameter values.
We present an adaptive distributed query-sampling framework that is quality-conscious for extracting high-quality text database samples. The framework divides the query-based sampling process into an initial seed sampling phase and a quality-aware iterative sampling phase. In the second phase the sampling process is dynamically scheduled based on estimated database size and quality parameters derived during the previous sampling process. The unique characteristic of our adaptive query-based sampling framework is its self-learning and self-configuring ability based on the overall quality of all text databases under consideration. We introduce three quality-conscious sampling schemes for estimating database quality, and our initial results show that the proposed framework supports higher-quality document sampling than existing approaches.
Large-scale web and text retrieval systems deal with amounts of data that greatly exceed the capacity of any single machine. To handle the necessary data volumes and query throughput rates, parallel systems are used, in which the document and index data are split across tightly-clustered distributed computing systems. The index data can be distributed either by document or by term. In this paper we examine methods for load balancing in term-distributed parallel architectures, and propose a suite of techniques for reducing net querying costs. In combination, the techniques we describe allow a 30% improvement in query throughput when tested on an eight-node parallel computer system.
Text message stream is a newly emerging type of Web data which is produced in enormous quantities with the popularity of Instant Messaging and Internet Relay Chat. It is beneficial for detecting the threads contained in the text stream for various applications, including information retrieval, expert recognition and even crime prevention. Despite its importance, not much research has been conducted so far on this problem due to the characteristics of the data in which the messages are usually very short and incomplete. In this paper, we present a stringent definition of the thread detection task and our preliminary solution to it. We propose three variations of a single-pass clustering algorithm for exploiting the temporal information in the streams. An algorithm based on linguistic features is also put forward to exploit the discourse structure information. We conducted several experiments to compare our approaches with some existing algorithms on a real dataset. The results show that all three variations of the single-pass algorithm outperform the basic single-pass algorithm. Our proposed algorithm based on linguistic features improves the performance relatively by 69.5% and 9.7% when compared with the basic single-pass algorithm and the best variation algorithm in terms of F1 respectively.
We present a new family of hybrid index maintenance strategies to be used in on-line index construction for monotonically growing text collections. These new strategies improve upon recent results for hybrid index maintenance in dynamic text retrieval systems. Like previous techniques, our new method distinguishes between short and long posting lists: While short lists are maintained using a merge strategy, long lists are kept separate and are updated in-place. This way, costly relocations of long posting lists are avoided. We discuss the shortcomings of previous hybrid methods and give an experimental evaluation of the new technique, showing that its index maintenance performance is superior to that of the earlier methods, especially when the amount of main memory available to the indexing system is small. We also present a complexity analysis which proves that, under a Zipfian term distribution, the asymptotical number of disk accesses performed by the best hybrid maintenance strategy is linear in the size of the text collection, implying the asymptotical optimality of the proposed strategy.
Exhaustive evaluation of ranked queries can be expensive, particularly when only a small subset of the overall ranking is required, or when queries contain common terms. This concern gives rise to techniques for dynamic query pruning, that is, methods for eliminating redundant parts of the usual exhaustive evaluation, yet still generating a demonstrably "good enough" set of answers to the query. In this work we propose new pruning methods that make use of impact-sorted indexes. Compared to exhaustive evaluation, the new methods reduce the amount of computation performed, reduce the amount of memory required for accumulators, reduce the amount of data transferred from disk, and at the same time allow performance guarantees in terms of precision and mean average precision. These strong claims are backed by experiments using the TREC Terabyte collection and queries.
Classical query expansion techniques such as the local context analysis (LCA) make use of term co-occurrence statistics to incorporate additional contextual terms for enhancing passage retrieval. However, relevant contextual terms do not always co-occur frequently with the query terms and vice versa. Hence the use of such methods often brings in noise, which leads to reduced precision. Previous studies have demonstrated the importance of relationship analysis for natural language queries in passage retrieval. However, they found that without query expansion, the performance is not satisfactory for short queries. In this paper, we present two novel query expansion techniques that make use of dependency relation analysis to extract contextual terms and relations from external corpuses. The techniques are used to enhance the performance of density based and relation based passage retrieval frameworks respectively. We compare the performance of the resulting systems with LCA in a density based passage retrieval system (DBS) and a relation based system without any query expansion (RBS) using the factoid questions from the TREC-12 QA task. The results show that in terms of MRR scores, our relation based term
There is a growing interest in estimating the effectiveness of search. Two approaches are typically considered: examining the search queries and examining the retrieved document sets. In this paper, we take the latter approach. We use four measures to characterize the retrieved document sets and estimate the quality of search. These measures are (i) the clustering tendency as measured by the Cox-Lewis statistic, (ii) the sensitivity to document perturbation, (iii) the sensitivity to query perturbation and (iv) the local intrinsic dimensionality. We present experimental results for the task of ranking 200 queries according to the search effectiveness over the TREC (discs 4 and 5) dataset. Our ranking of queries is compared with the ranking based on the average precision using the Kendall t statistic. The best individual estimator is the sensitivity to document perturbation and yields Kendall t of 0.521. When combined with the clustering tendency based on the Cox-Lewis statistic and the query perturbation measure, it results in Kendall t of 0.562 which to our knowledge is the highest correlation with the average precision reported to date.
Document clustering is an important tool for text analysis and is used in many different applications. We propose to incorporate prior knowledge of cluster membership for document cluster analysis and develop a novel semi-supervised document clustering model. The method models a set of documents with weighted graph in which each document is represented as a vertex, and each edge connecting a pair of vertices is weighted with the similarity value of the two corresponding documents. The prior knowledge indicates pairs of documents that known to belong to the same cluster. Then, the prior knowledge is transformed into a set of constraints. The document clustering task is accomplished by finding the best cuts of the graph under the constraints. We apply the model to the Normalized Cut method to demonstrate the idea and concept. Our experimental evaluations show that the proposed document clustering model reveals remarkable performance improvements with very limited training samples, and hence is a very effective semi-supervised classification tool.
Text clustering is most commonly treated as a fully automated task without user feedback. However, a variety of researchers have explored mixed-initiative clustering methods which allow a user to interact with and advise the clustering algorithm. This mixed-initiative approach is especially attractive for text clustering tasks where the user is trying to organize a corpus of documents into clusters for some particular purpose (e.g., clustering their email into folders that reflect various activities in which they are involved). This paper introduces a new approach to mixed-initiative clustering that handles several natural types of user feedback. We first introduce a new probabilistic generative model for text clustering (the SpeClustering model) and show that it outperforms the commonly used mixture of multinomials clustering model, even when used in fully autonomous mode with no user input. We then describe how to incorporate four distinct types of user feedback into the clustering algorithm, and provide experimental evidence showing substantial improvements in text clustering when this user feedback is incorporated.
For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both "almost-identical" documents in the data cleaning task and "relevant" documents in the search task. This paper presents an instance-level constrained clustering approach for near-duplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form near-duplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other near-duplicate detection algorithms and as about as effective as human assessors.
Traditionally, information retrieval systems aim to maximize the number of relevant documents returned to a user within some window of the top. For that goal, the probability ranking principle, which ranks documents in decreasing order of probability of relevance, is provably optimal. However, there are many scenarios in which that ranking does not optimize for the users information need. One example is when the user would be satisfied with some limited number of relevant documents, rather than needing all relevant documents. We show that in such a scenario, an attempt to return many relevant documents can actually reduce the chances of finding any relevant documents.
Searching an organization's document repositories for experts provides a cost effective solution for the task of expert finding. We present two general strategies to expert searching given a document collection which are formalized using generative probabilistic models. The first of these directly models an expert's knowledge based on the documents that they are associated with, whilst the second locates documents on topic, and then finds the associated expert. Forming reliable associations is crucial to the performance of expert finding systems. Consequently, in our evaluation we compare the different approaches, exploring a variety of associations along with other operational parameters (such as topicality). Using the TREC Enterprise corpora, we show that the second strategy consistently outperforms the first. A comparison against other unsupervised techniques, reveals that our second model delivers excellent performance.
High precision at the top ranks has become a new focus of research in information retrieval. This paper presents the multiple nested ranker approach that improves the accuracy at the top ranks by iteratively re-ranking the top scoring documents. At each iteration, this approach uses the RankNet learning algorithm to re-rank a subset of the results. This splits the problem into smaller and easier tasks and generates a new distribution of the results to be learned by the algorithm. We evaluate this approach using different settings on a data set labeled with several degrees of relevance. We use the normalized discounted cumulative gain (NDCG) to measure the performance because it depends not only on the position but also on the relevance score of the document in the ranked list. Our experiments show that making the learning algorithm concentrate on the top scoring results improves precision at the top ten documents in terms of the NDCG score.
Large scale learning is often realistic only in a semi-supervised setting where a small set of labeled examples is available together with a large collection of unlabeled data. In many information retrieval and data mining applications, linear classifiers are strongly preferred because of their ease of implementation, interpretability and empirical performance. In this work, we present a family of semi-supervised linear support vector classifiers that are designed to handle partially-labeled sparse datasets with possibly very large number of examples and features. At their core, our algorithms employ recently developed modified finite Newton techniques. Our contributions in this paper are as follows: (a) We provide an implementation of Transductive SVM (TSVM) that is significantly more efficient and scalable than currently used dual techniques, for linear classification problems involving large, sparse datasets. (b) We propose a variant of TSVM that involves multiple switching of labels. Experimental results show that this variant provides an order of magnitude further improvement in training efficiency. (c) We present a new algorithm for semi-supervised learning based on a Deterministic Annealing (DA) approach. This algorithm alleviates the problem of local minimum in the TSVM optimization procedure while also being computationally attractive. We conduct an empirical study on several document classification tasks which confirms the value of our methods in large scale semi-supervised settings.
Automatic classification of data items, based on training samples, can be boosted by considering the neighborhood of data items in a graph structure (e.g., neighboring documents in a hyperlink environment or co-authors and their publications for bibliographic data entries). This paper presents a new method for graph-based classification, with particular emphasis on hyperlinked text documents but broader applicability. Our approach is based on iterative relaxation labeling and can be combined with either Bayesian or SVM classifiers on the feature spaces of the given data items. The graph neighborhood is taken into consideration to exploit locality patterns while at the same time avoiding overfitting. In contrast to prior work along these lines, our approach employs a number of novel techniques: dynamically inferring the link/class pattern in the graph in the run of the iterative relaxation labeling, judicious pruning of edges from the neighborhood graph based on node dissimilarities and node degrees, weighting the influence of edges based on a distance metric between the classification labels of interest and weighting edges by content similarity measures. Our techniques considerably improve the robustness and accuracy of the classification outcome, as shown in systematic experimental comparisons with previously published methods on three different real-world datasets.
Supervised learning approaches to text classification are in practice often required to work with small and unsystematically collected training sets. The alternative to supervised learning is usually viewed to be building classifiers by hand, using a domain expert's understanding of which features of the text are related to the class of interest. This is expensive, requires a degree of sophistication about linguistics and classification, and makes it difficult to use combinations of weak predictors. We propose instead combining domain knowledge with training examples in a Bayesian framework. Domain knowledge is used to specify a prior distribution for the parameters of a logistic regression model, and labeled training data is used to produce a posterior distribution, whose mode we take as the final classifier. We show on three text categorization data sets that this approach can rescue what would otherwise be disastrously bad training situations, producing much more effective classifiers.
We propose that the information access behavior of a group of people can be modeled as an information flow issue, in which people intentionally or unintentionally influence and inspire each other, thus creating an interest in retrieving or getting a specific kind of information or product. Information flow models how information is propagated in a social network. It can be a real social network where interactions between people reside; it can be, moreover, a virtual social network in that people only influence each other unintentionally, for instance, through collaborative filtering. We leverage users' access patterns to model information flow and generate effective personalized recommendations. First, an early adoption based information flow (EABIF) network describes the influential relationships between people. Second, based on the fact that adoption is typically category specific, we propose a topic-sensitive EABIF (TEABIF) network, in which access patterns are clustered with respect to the categories. Once an item has been accessed by early adopters, personalized recommendations are achieved by estimating whom the information will be propagated to with high probabilities. In our experiments with an online document recommendation system, the results demonstrate that the EABIF and the TEABIF can respectively achieve an improved (precision, recall)
We are interested in retrieving information from conversational speech corpora, such as call-center data. This data comprises spontaneous speech conversations with low recording quality, which makes automatic speech recognition (ASR) a highly difficult task. For typical call-center data, even state-of-the-art large vocabulary continuous speech recognition systems produce a transcript with word error rate of 30% or higher. In addition to the output transcript, advanced systems provide word confusion networks (WCNs), a compact representation of word lattices associating each word hypothesis with its posterior probability. Our work exploits the information provided by WCNs in order to improve retrieval performance. In this paper, we show that the mean average precision (MAP) is improved using WCNs compared to the raw word transcripts. Finally, we analyze the effect of increasing ASR word error rate on search effectiveness. We show that MAP is still reasonable even under extremely high error rate.
Collaborative filtering techniques have become popular in the past decade as an effective way to help people deal with information overload. Recent research has identified significant vulnerabilities in collaborative filtering techniques. Shilling attacks, in which attackers introduce biased ratings to influence recommendation systems, have been shown to be effective against memory-based collaborative filtering algorithms. We examine the effectiveness of two popular shilling attacks (the random attack and the average attack) on a model-based algorithm that uses Singular Value Decomposition (SVD) to learn a low-dimensional linear model. Our results show that the SVD-based algorithm is much more resistant to shilling attacks than memory-based algorithms. Furthermore, we develop an attack detection method directly built on the SVD-based algorithm and show that this method detects random shilling attacks with high detection rates and very low false alarm rates.
This paper describes how the Bootstrap approach to statistics can be applied to the evaluation of IR effectiveness metrics. First, we argue that Bootstrap Hypothesis Tests deserve more attention from the IR community, as they are based on fewer assumptions than traditional statistical significance tests. We then describe straightforward methods for comparing the sensitivity of IR metrics based on Bootstrap Hypothesis Tests. Unlike the heuristics-based "swap" method proposed by Voorhees and Buckley, our method estimates the performance difference required to achieve a given significance level directly from Bootstrap Hypothesis Test results. In addition, we describe a simple way of examining the accuracy of rank correlation between two metrics based on the Bootstrap Estimate of Standard Error. We demonstrate the usefulness of our methods using test collections and runs from the NTCIR CLIR track for comparing seven IR metrics, including those that can handle graded relevance and those based on the Geometric Mean.
We introduce and validate bootstrap techniques to compute confidence intervals that quantify the effect of test-collection variability on average precision (AP) and mean average precision (MAP) IR effectiveness measures. We consider the test collection in IR evaluation to be a representative of a population of materially similar collections, whose documents are drawn from an infinite pool with similar characteristics. Our model accurately predicts the degree of concordance between system results on randomly selected halves of the TREC-6 ad hoc corpus. We advance a framework for statistical evaluation that uses the same general framework to model other sources of chance variation as a source of input for meta-analysis techniques.
We consider the problem of large-scale retrieval evaluation, and we propose a statistical method for evaluating retrieval systems using incomplete judgments. Unlike existing techniques that (1) rely on effectively complete, and thus prohibitively expensive, relevance judgment sets, (2) produce biased estimates of standard performance measures, or (3) produce estimates of non-standard measures thought to be correlated with these standard measures, our proposed statistical technique produces unbiased estimates of the standard measures themselves. Our proposed technique is based on random sampling. While our estimates are unbiased by statistical design, their variance is dependent on the sampling distribution employed; as such, we derive a sampling distribution likely to yield low variance estimates. We test our proposed technique using benchmark TREC data, demonstrating that a sampling pool derived from a set of runs can be used to efficiently and effectively evaluate those runs. We further show that these sampling pools generalize well to unseen runs. Our experiments indicate that highly accurate estimates of standard performance measures can be obtained using a number of relevance judgments as small as 4% of the typical TREC-style judgment pool.
Content-targeted advertising, the task of automatically associating ads to a Web page, constitutes a key Web monetization strategy nowadays. Further, it introduces new challenging technical problems and raises interesting questions. For instance, how to design ranking functions able to satisfy conflicting goals such as selecting advertisements (ads) that are relevant to the users and suitable and profitable to the publishers and advertisers? In this paper we propose a new framework for associating ads with web pages based on Genetic Programming (GP). Our GP method aims at learning functions that select the most appropriate ads, given the contents of a Web page. These ranking functions are designed to optimize overall precision and minimize the number of misplacements. By using a real ad collection and web pages from a newspaper, we obtained a gain over a state-of-the-art baseline method of 61.7% in average precision. Further, by evolving individuals to provide good ranking estimations, GP was able to discover ranking functions that are very effective in placing ads in web pages while avoiding irrelevant ones.
Many searches on the web have a transactional intent. We argue that pages satisfying transactional needs can be distinguished from the more common pages that have some information and links, but cannot be used to execute a transaction. Based on this hypothesis, we provide a recipe for constructing a transaction annotator. By constructing an annotator with one corpus and then demonstrating its classification performance on another, we establish its robustness. Finally, we show experimentally that a search procedure that exploits such pre-annotation greatly outperforms traditional search for retrieving transactional pages.
Information graphics are non-pictorial graphics such as bar charts and line graphs that depict attributes of entities and relations among entities. Most information graphics appearing in popular media have a communicative goal or intended message; consequently, information graphics constitute a form of language. This paper argues that information graphics are a valuable knowledge resource that should be retrievable from a digital library and that such graphics should be taken into account when summarizing a multimodal document for subsequent indexing and retrieval. But to accomplish this, the information graphic must be understood and its message recognized. The paper presents our Bayesian system for recognizing the primary message of one kind of information graphic (simple bar charts) and discusses the potential role of an information graphic's message in indexing graphics and summarizing multimodal documents.
We present an evaluation of a novel hierarchical text summarization method that allows users to view summaries of Web documents from small, mobile devices. Unlike previous approaches, ours does not require the documents to be in HTML since it infers a hierarchical structure automatically. Currently, the method is used to summarize news articles sent to a Web mail account in plain text format. Subjects used a Web-enabled mobile phone emulator to access the account's inbox and view the summarized news articles. They then used the summaries to complete several information-seeking tasks, which involved answering factual questions about the stories. In comparing the hierarchical text summary setting to that in which subjects were given the full text articles, there was no significant difference in task accuracy or the time taken to complete the task. However, in the hierarchical summarization setting, the number of bytes transferred per user request is less than half that of the full text case. Finally, in comparing the new method to three other summarization methods, subjects achieved significantly better accuracy on the tasks when using hierarchical summaries.
Automated singer identification is important in organising, browsing and retrieving data in large music databases. In this paper, we propose a novel scheme, called Hybrid Singer Identifier (HSI), for automated singer recognition. HSI can effectively use multiple low-level features extracted from both vocal and non-vocal music segments to enhance the identification process with a hybrid architecture and build profiles of individual singer characteristics based on statistical mixture models. Extensive experimental results conducted on a large music database demonstrate the superiority of our method over state-of-the-art approaches.
Clustering of search results is an important feature in many of today's information retrieval applications. The notion of hit list clustering appears in Web search engines and enterprise search engines as a mechanism that allows users to further explore the coverage of a query. However, there has been little work on exposing temporal attributes for constructing and presentation of clusters. These attributes appear in documents as part of the textual content, e.g., as a date and time token or as a temporal reference in a sentence. In this paper, we outline a model and describe a prototype that shows the main ideas.
We developed a prototype for integrated retrieval and aggregation of diverse information contained in scanned paper documents. Such complex document information processing combines several forms of image processing together with textual/linguistic processing to enable effective analysis of complex document collections, a necessity for a wide range of applications. This is the first system to attempt integrated retrieval from complex documents; we report its current capabilities.
We consider the problem of evaluating retrieval systems using a limited number of relevance judgments. Recent work has demonstrated that one can accurately estimate average precision via a judged pool corresponding to a relatively small random sample of documents. In this work, we demonstrate that given values or estimates of average precision, one can accurately infer the relevances of unjudged documents. Combined, we thus show how one can efficiently and accurately infer a large judged pool from a relatively small number of judged documents, thus permitting accurate and efficient retrieval evaluation on a large scale.
The primary aim of XML element retrieval is to return to users XML elements, rather than whole documents. This poster describes a small study, in which we elicited users' expectations, i.e. their anticipated experience, when interacting with an XML retrieval system, as compared to a traditional 'flat' document retrieval system.
We propose an integration of term proximity scoring into Okapi BM25. The relative retrieval effectiveness of our retrieval method, compared to pure BM25, varies from collection to collection. We present an experimental evaluation of our method and show that the gains achieved over BM25 as the size of the underlying text collection increases. We also show that for stemmed queries the impact of term proximity scoring is larger than for unstemmed queries.
The Web search multitasking study based on automatic task session detection procedure is described. The results of the study: 1) multitasking is very rare, 2) it usually covers only 2 task sessions, 3) it is frequently formed into a temporal inclusion of an interrupting task session into the interrupted session, 4) the quantitative characteristics of multitasking greatly differ from the characteristics of sequential execution of one and several tasks. A searcher minimizes task switching costs: he avoids multitasking and while multitasking he uses cheapest manner of task switching.
Vector Space Model (VSM) has been at the core of information retrieval for the past decades. VSM considers the documents as vectors in high dimensional space. In such a vector space, techniques like Latent Semantic Indexing (LSI), Support Vector Machines (SVM), Naive Bayes, etc., can be then applied for indexing and classification. However, in some cases, the dimensionality of the document space might be extremely large, which makes these techniques infeasible due to the curse of dimensionality. In this paper, we propose a novel Tensor Space Model for document analysis. We represent documents as the second order tensors, or matrices. Correspondingly, a novel indexing algorithm called Tensor Latent Semantic Indexing (TensorLSI) is developed in the tensor space. Our theoretical analysis shows that TensorLSI is much more computationally efficient than the conventional Latent Semantic Indexing, which makes it applicable for extremely large scale data set. Several experimental results on standard document data sets demonstrate the efficiency and effectiveness of our algorithm.
We present the results of the first large-scale Turkish information retrieval experiments performed on a TREC-like test collection. The test bed, which has been created for this study, contains 95.5 million words, 408,305 documents, 72 ad hoc queries and has a size of about 800MB. All documents come from the Turkish newspaper Milliyet. We implement and apply simple to sophisticated stemmers and various query-document matching functions and show that truncating words at a prefix length of 5 creates an effective retrieval environment in Turkish. However, a lemmatizer-based stemmer provides significantly better effectiveness over a variety of matching functions.
We introduce a novel approach to combining rankings from multiple retrieval systems. We use a logistic regression model or an SVM to learn a ranking from pairwise document preferences. Our approach requires no training data or relevance scores, and outperforms a popular voting algorithm.
In interactive question answering (QA), users and systems take turns to ask questions and provide answers. In such an interactive setting, user questions largely depend on the answers provided by the system. One question is whether user follow-up questions can provide feedback for the system to automatically assess its performance (e.g., assess whether a correct answer is delivered). This self-awareness can make QA systems more intelligent for information seeking, for example, by adapting better strategies to cope with problematic situations. Therefore, this paper describes our initial investigation in addressing this problem. Our results indicate that interaction context can provide useful cues for automated performance assessment in interactive QA.
Web catalog integration is an interesting problem in current digital content management. Past studies have shown that using a flattened structure with auxiliary information extracted from the source catalog can improve the integration results. However, the nature of a flattened structure ignores the hierarchical relationships, and thus the performance improvement of catalog integration may be reduced. In this paper, we propose an enhanced hierarchical catalog integration (EHCI) approach with conceptual thesauri extracted from the source catalog. The results show that our enhanced hierarchical integration approach effectively boosts the accuracy of hierarchical catalog integration.
We present rpref, our generalization of the bpref evaluation metric for assessing the quality of search engine results, given graded rather than binary user relevance judgments.
In this paper, we present a novel multi-webpage summarization algorithm. It adds the graph based ranking algorithm into the framework of Maximum Marginal Relevance (MMR) method, to not only capture the main topic of the web pages but also eliminate the redundancy existing in the sentences of the summary result. The experiment result indicates that the new approach has the better performance than the previous methods.
In this paper, we show that PLSI and NMF optimize the same objective function, although PLSI and NMF are different algorithms as verified by experiments. In addition, we also propose a new hybrid method that runs PLSI and NMF alternatively to achieve better solutions.
This paper employs ConceptNet, which covers a rich set of commonsense concepts, to retrieve images with text descriptions by focusing on spatial relationships. Evaluation on test data of the 2005 ImageCLEF shows that integrating commonsense knowledge in information retrieval is feasible.
Practical constrains of user interfaces make the user's judgment (during the feedback loop) deviate from real thoughts (when the full document is read). This is often overlooked in evaluation of relevance feedback. This paper quantitatively analyze the impact of judging inconsistency on the performance of relevance feedback.
Extracting morphemes from words is a nontrivial task. Rule based stemming approaches such as Porter's algorithm have encountered some success, however they are restricted by their ability to identify a limited number of affixes and are language dependent. When dealing with languages with many affixes, rule based approaches generally require many more rules to deal with all the possible word forms. Deriving these rules requires a larger effort on the part of linguists and in some instances can be simply impractical. We propose an unsupervised ngram based approach, named Swordfish. Using ngram probabilities in the corpus, possible morphemes are identified. We look at two possible methods for identifying candidate morphemes, one using joint probabilities between two ngrams, and the second based on log odds between prefix probabilities. Initial results indicate the joint probability approach to be better for English while the prefix ratio approach is better for Finnish and Turkish.
In this paper, we use a blog corpus to demonstrate that we can often identify the author of an anonymous text even where there are many thousands of candidate authors. Our approach combines standard information retrieval methods with a text categorization meta-learning scheme that determines when to even venture a guess.
We explore interactive methods to further improve the performance of pseudo-relevance feedback. Studies [4] suggest that new methods for tackling difficult queries are required. Our approach is to gather more information about the query from the user by asking her simple questions. The equally simple responses are used to modify the original query. Our experiments using the TREC Robust Track queries show that we can obtain a significant improvement in mean average precision averaging around 5% over pseudo-relevance feedback. This improvement is also spread across more queries compared to ordinary pseudo-relevance feedback, as suggested by geometric mean average precision.
The aim of this study is to investigate whether element retrieval (as opposed to full-text retrieval) is meaningful and useful for searchers when carrying out information-seeking tasks. Our results suggest that searchers find the structural breakdown of documents useful when browsing within retrieved documents, and provide support for the usefulness of element retrieval in interactive settings.
Research and development of information access technology for scanned paper documents has been hampered by the lack of public test collections of realistic scope and complexity. As part of a project to create a prototype system for search and mining of masses of document images, we are assembling a 1.5 terabyte dataset to support evaluation of both end-to-end complex document information processing (CDIP) tasks (e.g., text retrieval and data mining) as well as component technologies such as optical character recognition (OCR), document structure analysis, signature matching, and authorship attribution.
In this paper, we propose a new strategy with time granularity reasoning for utilizing temporal information in topic tracking. Compared with previous ones, our work has four distinguished characteristics. Firstly, we try to determine a set of topic times for a target topic from the given on-topic stories. It helps to avoid the negative influence from other irrelevant times. Secondly, we take into account time granularity variance when deciding whether a coreference relationship exists between two times. Thirdly, both publication time and times presented in texts are considered. Finally, as time is only one attribute of a topic, we increase the similarity between a story and a target topic only when they are related not only temporally but also semantically. Experiments on two TDT corpora show that our method makes good use of temporal information in news stories.
This study investigates the impact of different search feature designs in DLs on user search experience. The results indicate that the impact is significant in terms of the number of queries issued, search steps, zero-hits pages returned, and search errors.
This paper proposes a novel framework for music content indexing and retrieval. The music structure information, i.e., timing, harmony and music region content, is represented by the layers of the music structure pyramid. We begin by extracting this layered structure information. We analyze the rhythm of the music and then segment the signal proportional to the inter-beat intervals. Thus, the timing information is incorporated in the segmentation process, which we call Beat Space Segmentation. To describe Harmony Events, we propose a two-layer hierarchical approach to model the music chords. We also model the progression of instrumental and vocal content as Acoustic Events. After information extraction, we propose a vector space modeling approach which uses these events as the indexing terms. In query-by-example music retrieval, a query is represented by a vector of the statistics of the n-gram events. We then propose two effective retrieval models, a hard-indexing scheme and a soft-indexing scheme. Experiments show that the vector space modeling is effective in representing the layered music information, achieving 82.5% top-5 retrieval accuracy using 15-sec music clips as the queries. The soft-indexing outperforms hard-indexing in general.
Early speech retrieval experiments focused on news broadcasts, for which adequate Automatic Speech Recognition (ASR) accuracy could be obtained. Like newspapers, news broadcasts are a manually selected and arranged set of stories. Evaluation designs reflected that, using known story boundaries as a basis for evaluation. Substantial advances in ASR accuracy now make it possible to build search systems for some types of spontaneous conversational speech, but present evaluation designs continue to rely on known topic boundaries that are no longer well matched to the nature of the materials. We propose a new class of measures for speech retrieval based on manual annotation of points at which a user with specific topical interests would wish replay to begin.
Emails are examples of structured documents with various fields. These fields can be exploited to enhance the retrieval effectiveness of an Information Retrieval (IR) system that mailing list archives. In recent experiments of the TREC2005 Enterprise track, various fields were applied to varying degrees of success by the participants. In his work, using a field-based weighting model, we investigate the retrieval performance attainable by each field, and examine when fields evidence should be combined or not.
We present a simple way to improve document retrieval for question answering systems. The method biases the retrieval system toward documents that contain words that have appeared in other documents containing answers to the same type of question. The method works with virtually any retrieval system, and exhibits a statistically significant performance improvement over a strong baseline.
We consider the relationship between training set size and the parameter k for the k-Nearest Neighbors (kNN) classifier. When few examples are available, we observe that accuracy is sensitive to k and that best k tends to increase with training size. We explore the subsequent risk that k tuned on partitions will be suboptimal after aggregation and re-training. This risk is found to be most severe when little data is available. For larger training sizes, accuracy becomes increasingly stable with respect to k and the risk decreases.
Methods for detecting sentences in an input document set, which are both relevant and novel with respect to an information need, would be of direct benefit to many systems, such as extractive text summarizers. However, satisfactory levels of agreement between judges performing this task manually have yet to demonstrated, leaving researchers to conclude that the task is too subjective. In previous experiments, judges were asked to first identify sentences that are relevant to a general topic, and then to eliminate sentences from the list that do not contain new information. Currently, a new task is proposed, in which annotators perform the same procedure, but within the context of a specific, factual information need. In the experiment, satisfactory levels of agreement between independent annotators were achieved on the first step of identifying sentences containing relevant information relevant. However, the results indicate that judges do not agree on which sentences contain novel information.
The exponential growth of the Web and the increasing ability of web search engines to index data have led to a problem of plenty. The number of results returned per query is typically in the order of millions of documents for many common queries. Although there is the benefit of added coverage for every query, the problem of ranking these documents and giving the best results gets worse. The problem is even more difficult in case of temporal and ambiguous queries. We try to address this problem using feedback from user query logs. We leverage a technology called Units for generating query refinements which are shown as Also try queries on Yahoo! Search. We consider these refinements as sub-concepts which help define user intent and use them to improve search relevance. The results obtained via live testing on Yahoo! Search are encouraging.
We present and evaluate methods for diversifying search results to improve personalized web search. A common personalization approach involves reranking the top N search results such that documents likely to be preferred by the user are presented higher. The usefulness of reranking is limited in part by the number and diversity of results considered. We propose three methods to increase the diversity of the top results and evaluate the effectiveness of these methods.
Small XML elements are often estimated relevant by the retrieval model but they are not desirable retrieval units. This paper presents a generic model that exploits the information obtained from small elements. We identify relationships between small and relevant elements and use this linking information to reinforce the relevance of other elements before removing the small ones. Our experiments using the INEX testbed show the effectiveness of our approach.
We introduce an evaluation metric called P-measure for the task of retrieving one highly relevant document. It models user behaviour in practical tasks such as known-item search, and is more stable and sensitive than Reciprocal Rank which cannot handle graded relevance.
The performance of document clustering systems depends on employing optimal text representations, which are not only difficult to determine beforehand, but also may vary from one clustering problem to another. As a first step towards building robust document clusterers, a strategy based on feature diversity and cluster ensembles is presented in this work. Experiments conducted on a binary clustering problem show that our method is robust to near-optimal model order selection and able to detect constructive interactions between different document representations in the test bed.
We hypothesized that language modeling retrieval would improve if we reduced the need for document smoothing to provide an inverse document frequency (IDF) like effect. We created inverse collection frequency (ICF) weighted query models as a tool to partially separate the IDF-like role from document smoothing. Compared to maximum likelihood estimated (MLE) queries, the ICF weighted queries achieved a 6.4% improvement in mean average precision on description queries. The ICF weighted queries performed better with less document smoothing than that required by MLE queries. Language modeling retrieval may benefit from a means to separately incorporate an IDF-like behavior outside of document smoothing.
Recently, interest is growing in non-topical text classification tasks such as genre classification, sentiment analysis, and authorship profiling. We study to what extent OCR errors affect stylistic text classification from scanned documents. We find that even a relatively high level of errors in the OCRed documents does not substantially affect stylistic classification accuracy.
Thanks to the ubiquity of the Internet search engine search box, users have come to depend on search engines both to find and re-find information. However, re-finding behavior has not been significantly addressed. Here we look at re-finding queries issued to the Yahoo! search engine by 114 users over a year.
We report the statistically significant mean impacts of blind feedback, as implemented by 7 participants for the 2003 Reliable Information Access (RIA) Workshop, on 30 retrieval measures, including several primary recall measures not originally reported. We find that blind feedback was detrimental to measures focused on the first relevant item even when it boosted "early precision" measures such as mean Precision@10, implying that the conventional reporting of ad hoc precision needs enhancement.
For many years it has been commonly held that a user who adds structural "hints" to a query will improve precision in an element retrieval search. At INEX 2005 we conducted an experiment to test this assumption. We present the unexpected result that structural hints in queries do not improve precision. An analysis of the topics and the judgments suggests that this is because users are particularly bad at giving structural hints.
In this poster, we describe the study of an interface technique that provides a list of suggested additional query terms as a searcher types a search query, in effect offering interactive query expansion (IQE) options while the query is formulated. Analysis of the results shows that offering IQE during query formulation leads to better quality initial queries, and an increased uptake of query expansion. These findings have implications for how IQE should be offered in retrieval interfaces.
Label propagation exploits the structure of the unlabeled documents by propagating the label information of the training documents to the unlabeled documents. The limitation with the existing label propagation approaches is that they can only deal with a single type of objects. We propose a framework, named "relation propagation", that allows for information propagated among multiple types of objects. Empirical studies with multi-label text categorization showed that the proposed algorithm is more effective than several semi-supervised learning algorithms in that it is capable of exploring the correlation among different categories and the structure of unlabeled documents simultaneously.
In this work, we study similarity measures for text-centric XML documents based on an extended vector space model, which considers both document content and structure. Experimental results based on a benchmark showed superior performance of the proposed measure over the baseline which ignores structural knowledge of XML documents.
We discuss information retrieval methods that aim at serving a diverse stream of user queries. We propose methods that emphasize the importance of taking into consideration of query difference in learning effective retrieval functions. We formulate the problem as a multi-task learning problem using a risk minimization framework. In particular, we show how to calibrate the empirical risk to incorporate query difference in terms of introducing nuisance parameters in the statistical models, and we also propose an alternating optimization method to simultaneously learn the retrieval function and the nuisance parameters. We illustrate the effectiveness of the proposed methods using modeling data extracted from a commercial search engine.
One challenging problem for biomedical text retrieval is to find accurate synonyms or name variants for biomedical entities. In this paper, we propose a new concept-based approach to tackle this problem. In this approach, a set of concepts instead of keywords will be extracted from a query first. Then these concepts will be used for retrieval purpose. The experiment results show that the proposed approach can boost the retrieval performance and it generates very good results on 2005 TREC Genomics data sets.
Much interesting text on the web consists largely of opinionated or evaluative text, as opposed to directly informative text. The new field of 'sentiment analysis' seeks to characterize such aspects of natural language text, as opposed to just the bare facts. We suggest that 'appraisal expression extraction' should be viewed as a fundamental task for sentiment analysis. We define an 'appraisal expression' to be a piece of text expressing some evaluative stance towards a particular object. The task is to find these elements and characterize the type and orientation (positive or negative) of the evaluative stance, as well as its target and possibly its source. Potential applications of these methods include new approaches to the now-traditional tasks of sentiment classification and pinion mining, as well as possibly for adversarial textual analysis and intention detection for intelligence applications.
We present an extensible java-based platform for contextual retrieval based on the probabilistic information retrieval model. Modules for dual indexes, relevance feedback with blind or machine learning approaches and query expansion with context are integrated into the Okapi system to deal with the contextual information. This platform allows easy extension to include other types of contextual information.
The Personal Project Planner prototype works as an extension to the file manager to provide people with rich-text overlays to their information (folders, files and also email, web pages, notes). Rich-text, document-like project plans can be created which then provide a context in which to create or reference the email messages, electronic documents, web pages, etc. that are needed to complete the plan. The user can later locate an information item such as an email message with reference to the plan (e.g., as an alternative to a mostly context-free search through the inbox or sent mail). The Planner explores a possibility that an effective organization of project-related information can emerge as a natural by-product of efforts to plan and structure the project.
A new shot level video retrieval system that supports semantic visual features (e.g., car, mountain, and fire) browsing is developed to facilitate content-based retrieval. The video's binary semantic feature vector is utilized to calculate the score of similarity between two shot keyframes. The score is then used to browse the "similar" keyframes in terms of semantic visual features.
Since the website is one of the most important organizational structures of the Web, how to effectively rank websites has been essential to many Web applications, such as Web search and crawling. In order to get the ranks of websites, researchers used to describe the inter-connectivity among websites with a so-called HostGraph in which the nodes denote websites and the edges denote linkages between websites (if and only if there are hyperlinks from the pages in one website to the pages in the other, there will be an edge between these two websites), and then adopted the random walk model in the HostGraph. However, as pointed in this paper, the random walk over such a HostGraph is not reasonable because it is not in accordance with the browsing behavior of web surfers. Therefore, the derivate rank cannot represent the true probability of visiting the corresponding website. In this work, we mathematically proved that the probability of visiting a website by the random web surfer should be equal to the sum of the PageRank values of the pages inside that website. Nevertheless, since the number of web pages is much larger than that of websites, it is not feasible to base the calculation of the ranks of websites on the calculation of PageRank. To tackle this problem, we proposed a novel method named AggregateRank rooted in the theory of stochastic complement, which cannot only approximate the sum of PageRank accurately, but also have a lower computational complexity than PageRank. Both theoretical analysis and experimental evaluation show that AggregateRank is a better method for ranking websites than previous methods.
We present an approach to improving the precision of an initial document ranking wherein we utilize cluster information within a graph-based framework. The main idea is to perform reranking based on centrality within bipartite graphs of documents (on one side) and clusters (on the other side), on the premise that these are mutually reinforcing entities. Links between entities are created via consideration of language models induced from them. We find that our cluster-document graphs give rise to much better retrieval performance than previously proposed document-only graphs do. For example, authority-based reranking of documents via a HITS-style cluster-based approach outperforms a previously-proposed PageRank-inspired algorithm applied to solely-document graphs. Moreover, we also show that computing authority scores for clusters constitutes an effective method for identifying clusters containing a large percentage of relevant documents.
Traditional web link-based ranking schemes use a single score to measure a page's authority without concern of the community from which that authority is derived. As a result, a resource that is highly popular for one topic may dominate the results of another topic in which it is less authoritative. To address this problem, we suggest calculating a score vector for each page to distinguish the contribution from different topics, using a random walk model that probabilistically combines page topic distribution and link structure. We show how to incorporate the topical model within both PageRank and HITS without affecting the overall property and still render insight into topic-level transition. Experiments on multiple datasets indicate that our technique outperforms other ranking approaches that incorporate textual analysis.
Give us your opinion! Do you have any comments/additions that you would like other visitors to see?
Yousay:
Mar 20th, 2010
#1
Be the first to add a thoughtful note to this page !
Changes to this page (conference)
24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited
Computer programs emerge as the outcome of complex human processes of cognition, communication and negotiation, which serve to establish the meaningful embedding of the computer system in its intended use context.
Page maintainer: The Editorial Team How to cite/reference this page
URL: http://www.interaction-design.org/references/conferences/proceedings_of_the_29th_annual_international_acm_sigir_conference_on_research_and_development_in_information_retrieval.html