SIGIR is the major international forum for the presentation of new research results and the demonstration of new systems and techniques in the field of information retrieval.
The following articles are from "Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval":
The Web graph, meaning the graph induced by Web pages as nodes and their hyperlinks as directed edges, has become a fascinating object of study for many people: physicists, sociologists, mathematicians, computer scientists, and information retrieval specialists. Recent results range from theoretical (e.g.: models for the graph, semi-external algorithms), to experimental (e.g.: new insights regarding the rate of change of pages, new data on the distribution of degrees), to practical (e.g.: improvements in crawling technology).Recent results range from theoretical (e.g.: models for the graph, semi-external algorithms), to experimental (e.g.: new insights regarding the rate of change of pages, new data on the distribution of degrees), to practical (e.g.: improvements in crawling technology).The goal of this talk is to convey an introduction to the state of the art in this area and to sketch the current issues in collecting, representing, analyzing, and modeling this graph. Although graph analytic methods are essential tools in the Web IR arsenal, they are well known to the SIGIR community and will not be discussed here in any detail; instead, we will explore some challenges and opportunities for using IR methods and techniques in the exploration of the Web graph, in particular in dealing with legitimate and "spam" perturbations of the "natural" process of birth and death of nodes and links, and conversely, the challenges and opportunities of using graph methods in support of IR on the Web and in the enterprise.
We present a non-traditional retrieval problem we call subtopic retrieval. The subtopic retrieval problem is concerned with finding documents that cover many different subtopics of a query topic. In such a problem, the utility of a document in a ranking is dependent on other documents in the ranking, violating the assumption of independent relevance which is assumed in most traditional retrieval methods. Subtopic retrieval poses challenges for evaluating performance, as well as for developing effective algorithms. We propose a framework for evaluating subtopic retrieval which generalizes the traditional precision and recall metrics by accounting for intrinsic topic difficulty as well as redundancy in documents. We propose and systematically evaluate several methods for performing subtopic retrieval using statistical language models and a maximal marginal relevance (MMR) ranking strategy. A mixture model combined with query likelihood relevance ranking is shown to modestly outperform a baseline relevance ranking on a data set used in the TREC interactive track.
We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using the suffix array data structure. Additionally, the computation procedure can be improved to locate the sets of duplicate or plagiarised documents. We applied the technique to several standard text collections and found that they contained a significant number of duplicate and plagiarised documents. Another reformulation of the method leads to an algorithm that can be applied to supervised multi-class categorization. We illustrate the approach using the recently available Reuters Corpus Volume 1 (RCV1). The results show that the method outperforms SVM at multi-class categorization, and interestingly, that results correlate strongly with compression-based methods.
Text classifiers that give probability estimates are more readily applicable in a variety of scenarios. For example, rather than choosing one set decision threshold, they can be used in a Bayesian risk model to issue a run-time decision which minimizes a user-specified cost function dynamically chosen at prediction time. However, the quality of the probability estimates is crucial. We review a variety of standard approaches to converting scores (and poor probability estimates) from text classifiers to high quality estimates and introduce new models motivated by the intuition that the empirical score distribution for the "extremely irrelevant", "hard to discriminate", and "obviously relevant" items are often significantly different. Finally, we analyze the experimental performance of these models over the outputs of two text classifiers. The analysis demonstrates that one of these models is theoretically attractive (introducing few new parameters while increasing flexibility), computationally efficient, and empirically preferable.
Libraries have traditionally used manual image annotation for indexing and then later retrieving their image collections. However, manual image annotation is an expensive and labor intensive procedure and hence there has been great interest in coming up with automatic ways to retrieve images based on content. Here, we propose an automatic approach to annotating and retrieving images based on a training set of images. We assume that regions in an image can be described using a small vocabulary of blobs. Blobs are generated from image features using clustering. Given a training set of images with annotations, we show that probabilistic models allow us to predict the probability of generating a word given the blobs in an image. This may be used to automatically annotate and retrieve images given a word as a query. We show that relevance models allow us to derive these probabilities in a natural way. Experiments show that the annotation performance of this cross-media relevance model is almost six times as good (in terms of mean precision) than a model based on word-blob co-occurrence model and twice as good as a state of the art model derived from machine translation. Our approach shows the usefulness of using formal information retrieval models for the task of image annotation and retrieval.
We consider the problem of modeling annotated data -- data with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). We describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent Dirichlet allocation, a latent variable model that is effective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. We conduct experiments on the Corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval.
The main conclusion from the metrics-based evaluation of video retrieval systems at TREC's video track is that non-interactive image retrieval from general collections using visual information only is not yet feasible. We show how a detailed analysis of retrieval results -- looking beyond mean average precision (MAP) scores on topical relevance -- gives significant insight in the main problems with the visual part of the retrieval model under study. Such an analytical approach proves an important addition to standard evaluation measures.
This paper investigates the pre-conditions for successful combination of document representations formed from structural markup for the task of known-item search. As this task is very similar to work in meta-search and data fusion, we adapt several hypotheses from those research areas and investigate them in this context. To investigate these hypotheses, we present a mixture-based language model and also examine many of the current meta-search algorithms. We find that compatible output from systems is important for successful combination of document representations. We also demonstrate that combining low performing document representations can improve performance, but not consistently. We find that the techniques best suited for this task are robust to the inclusion of poorly performing document representations. We also explore the role of variance of results across systems and its impact on the performance of fusion, with the surprising result that the correct documents have higher variance across document representations than highly ranking incorrect documents.
Most of the work on XML query and search has stemmed from the publishing and database communities, mostly for the needs of business applications. Recently, the Information Retrieval community began investigating the XML search issue to answer information discovery needs. Following this trend, we present here an approach where information needs can be expressed in an approximate manner as pieces of XML documents or "XML fragments" of the same nature as the documents that are being searched. We present an extension of the vector space model for searching XML collections via XML fragments and ranking results by relevance. We describe how we have extended a full-text search engine to comply with this model. The value of the proposed method is demonstrated by the relative high precision of our system, which was among the top performers in the recent INEX workshop. Our results indicate that certain queries are more appropriate than others for the extended vector space model. Specifically, queries with relatively specific contexts but vague information needs are best situated to reap the benefit of this model. Finally our results show that one method may not fit all types of queries and that it could be worthwhile to use different solutions for different applications.
Word sense ambiguity is recognized as having a detrimental effect on the precision of information retrieval systems in general and web search systems in particular, due to the sparse nature of the queries involved. Despite continued research into the application of automated word sense disambiguation, the question remains as to whether less than 90% accurate automated word sense disambiguation can lead to improvements in retrieval effectiveness. In this study we explore the development and subsequent evaluation of a statistical word sense disambiguation system which demonstrates increased precision from a sense based vector space retrieval model over traditional TF*IDF techniques.
This paper presents an algorithm to generate possible variants for biomedical terms. The algorithm gives each variant its generation probability representing its plausibility, which is potentially useful for query and dictionary expansions. The probabilistic rules for generating variants are automatically learned from raw texts using an existing abbreviation extraction technique. Our method, therefore, requires no linguistic knowledge or labor-intensive natural language resource. We conducted an experiment using 83,142 MEDLINE abstracts for rule induction and 18,930 abstracts for testing. The results indicate that our method will significantly increase the number of retrieved documents for long biomedical terms.
A novel maximal figure-of-merit (MFoM) learning approach to text categorization is proposed. Different from the conventional techniques, the proposed MFoM method attempts to integrate any performance metric of interest (e.g. accuracy, recall, precision, or F1 measure) into the design of any classifier. The corresponding classifier parameters are learned by optimizing an overall objective function of interest. To solve this highly nonlinear optimization problem, we use a generalized probabilistic descent algorithm. The MFoM learning framework is evaluated on the Reuters-21578 task with LSI-based feature extraction and a binary tree classifier. Experimental results indicate that the MFoM classifier gives improved F1 and enhanced robustness over the conventional one. It also outperforms the popular SVM method in micro-averaging F1. Other extensions to design discriminative multiple-category MFoM classifiers for application scenarios with new performance metrics could be envisioned too.
Much work in information retrieval focuses on using a model of documents and queries to derive retrieval algorithms. Model based development is a useful alternative to heuristic development because in a model the assumptions are explicit and can be examined and refined independent of the particular retrieval algorithm. We explore the explicit assumptions underlying the naive framework by performing computational analysis of actual corpora and queries to devise a generative document model that closely matches text. Our thesis is that a model so developed will be more accurate than existing models, and thus more useful in retrieval, as well as other applications. We test this by learning from a corpus the best document model. We find the learned model better predicts the existence of text data and has improved performance on certain IR tasks.
Term-based representations of documents have found wide-spread use in information retrieval. However, one of the main shortcomings of such methods is that they largely disregard lexical semantics and, as a consequence, are not sufficiently robust with respect to variations in word usage. In this paper we investigate the use of concept-based document representations to supplement word- or phrase-based features. The utilized concepts are automatically extracted from documents via probabilistic latent semantic analysis. We propose to use AdaBoost to optimally combine weak hypotheses based on both types of features. Experimental results on standard benchmarks confirm the validity of our approach, showing that AdaBoost achieves consistent improvements by including additional semantic features in the learned ensemble.
Real-world applications often require the classification of documents under situations of small number of features, mis-labeled documents and rare positive examples. This paper investigates the robustness of three regularized linear classification methods (SVM, ridge regression and logistic regression) under above situations. We compare these methods in terms of their loss functions and score distributions, and establish the connection between their optimization problems and generalization error bounds. Several sets of controlled experiments on the Reuters-21578 corpus are conducted to investigate the robustness of these methods. Our results show that ridge regression seems to be the most promising candidate for rare class problems.
Term dependence is a natural consequence of language use. Its successful representation has been a long standing goal for Information Retrieval research. We present a methodology for the construction of a concept hierarchy that takes into account the three basic dimensions of term dependence. We also introduce a document evaluation function that allows the use of the concept hierarchy as a user profile for Information Filtering. Initial experimental results indicate that this is a promising approach for incorporating term dependence in the way documents are filtered.
Following the tradition of these acceptance talks, I will be giving my thoughts on where our field is going. Any discussion of the future of information retrieval (IR) research, however, needs to be placed in the context of its history and relationship to other fields. Although IR has had a very strong relationship with library and information science, its relationship to computer science (CS) and its relative standing as a sub-discipline of CS has been more dynamic. IR is quite an old field, and when a number of CS departments were forming in the 60s, it was not uncommon for a faculty member to be pursuing research related to IR. Early ACM curriculum recommendations for CS contained courses on information retrieval, and encyclopedias described IR and database systems as different aspects of the same field.
Query length in best-match information retrieval (IR) systems is well known to be positively related to effectiveness in the IR task, when measured in experimental, non-interactive environments. However, in operational, interactive IR systems, query length is quite typically very short, on the order of two to three words. We report on a study which tested the effectiveness of a particular query elicitation technique in increasing initial searcher query length, and which tested the effectiveness of queries elicited using this technique, and the relationship in general between query length and search effectiveness in interactive IR. Results show that the specific technique results in longer queries than a standard query elicitation technique, that this technique is indeed usable, that the technique results in increased user satisfaction with the search, and that query length is positively correlated with user satisfaction with the search.
Much attention has been paid to the relative effectiveness of interactive query expansion versus automatic query expansion. Although interactive query expansion has the potential to be an effective means of improving a search, in this paper we show that, on average, human searchers are less likely than systems to make good expansion decisions. To enable good expansion decisions, searchers must have adequate instructions on how to use interactive query expansion functionalities. We show that simple instructions on using interactive query expansion do not necessarily help searchers make good expansion decisions and discuss difficulties found in making query expansion decisions.
We seek insight into Latent Semantic Indexing by establishing a method to identify the optimal number of factors in the reduced matrix for representing a keyword. This method is demonstrated empirically by duplicating all documents containing a term t, and inserting new documents in the database that replace t with t'. By examining the number of times term t is identified for a search on term t' (precision) using differing ranges of dimensions, we find that lower ranked dimensions identify related terms and higher-ranked dimensions discriminate between the synonyms.
The ability to find tables and extract information from them is a necessary component of data mining, question answering, and other information retrieval tasks. Documents often contain tables in order to communicate densely packed, multi-dimensional information. Tables do this by employing layout patterns to efficiently indicate fields and records in two-dimensional form. Their rich combination of formatting and content present difficulties for traditional language modeling techniques, however. This paper presents the use of conditional random fields (CRFs) for table extraction, and compares them with hidden Markov models (HMMs). Unlike HMMs, CRFs support the use of many rich and overlapping layout and language features, and as a result, they perform significantly better. We show experimental results on plain-text government statistical reports in which tables are located with 92% F1, and their constituent lines are classified into 12 table-related categories with 94% accuracy. We also discuss future work on undirected graphical models for segmenting columns, finding cells, and classifying them as data cells or label cells.
Test collections for the filtering track in TREC have typically used either past sets of relevance judgments, or categorized collections such as Reuters Corpus Volume 1 or OHSUMED, because filtering systems need relevance judgments during the experiment for training and adaptation. For TREC 2002, we constructed an entirely new set of search topics for the Reuters Corpus for measuring filtering systems. Our method for building the topics involved multiple iterations of feedback from assessors, and fusion of results from multiple search systems using different search algorithms. We also developed a second set of "inexpensive" topics based on categories in the document collection. We found that the initial judgments made for the experiment were sufficient; subsequent pooled judging changed system rankings very little. We also found that systems performed very differently on the category topics than on the assessor-built topics.
Reflecting the rapid growth in the utilization of large test collections for information retrieval since the 1990s, extensive comparative experiments have been performed to explore the effectiveness of various retrieval models. However, most collections were intended for retrieving newspaper articles and technical abstracts. In this paper, we describe the process of producing a test collection for patent retrieval, the NTCIR-3 Patent Retrieval Collection, which includes two years of Japanese patent applications and 31 topics produced by professional patent searchers. We also report experimental results obtained by using this collection to re-examine the effectiveness of existing retrieval models in the context of patent retrieval. The relative superiority among existing retrieval models did not significantly differ depending on the document genre, that is, patents and newspaper articles. Issues related to patent retrieval are also discussed.
Collaborative filtering aims at learning predictive models of user preferences, interests or behavior from community data, i.e. a database of available user preferences. In this paper, we describe a new model-based algorithm designed for this task, which is based on a generalization of probabilistic latent semantic analysis to continuous-valued response variables. More specifically, we assume that the observed user ratings can be modeled as a mixture of user communities or interest groups, where users may participate probabilistically in one or more groups. Each community is characterized by a Gaussian distribution on the normalized ratings for each item. The normalization of ratings is performed in a user-specific manner to account for variations in absolute shift and variance of ratings. Experiments on the EachMovie data set show that the proposed approach compares favorably with other collaborative filtering techniques.
Most existing clustering algorithms cluster highly related data objects such as Web pages and Web users separately. The interrelation among different types of data objects is either not considered, or represented by a static feature space and treated in the same ways as other attributes of the objects. In this paper, we propose a novel clustering approach for clustering multi-type interrelated data objects, ReCoM (Reinforcement Clustering of Multi-type Interrelated data objects). Under this approach, relationships among data objects are used to improve the cluster quality of interrelated data objects through an iterative reinforcement clustering process. At the same time, the link structure derived from relationships of the interrelated data objects is used to differentiate the importance of objects and the learned importance is also used in the clustering process to further improve the clustering results. Experimental results show that the proposed approach not only effectively overcomes the problem of data sparseness caused by the high dimensional relationship space but also significantly improves the clustering accuracy.
Content-based music genre classification is a fundamental component of music information retrieval systems and has been gaining importance and enjoying a growing amount of attention with the emergence of digital music on the Internet. Currently little work has been done on automatic music genre classification, and in addition, the reported classification accuracies are relatively low. This paper proposes a new feature extraction method for music genre classification, DWCHs. DWCHs stands for Daubechies Wavelet Coefficient Histograms. DWCHs capture the local and global information of music signals simultaneously by computing histograms on their Daubechies wavelet coefficients. Effectiveness of this new feature and of previously studied features are compared using various machine learning classification algorithms, including Support Vector Machines and Linear Discriminant Analysis. It is demonstrated that the use of DWCHs significantly improves the accuracy of music genre classification.
In a federated digital library system, it is too expensive to query every accessible library. Resource selection is the task to decide to which libraries a query should be routed. Most existing resource selection algorithms compute a library ranking in a heuristic way. In contrast, the decision-theoretic framework (DTF) follows a different approach on a better theoretic foundation: It computes a selection which minimises the overall costs (e.g. retrieval quality, time, money) of the distributed retrieval. For estimating retrieval quality the recall-precision function is proposed. In this paper, we introduce two new methods: The first one computes the empirical distribution of the probabilities of relevance from a small library sample, and assumes it to be representative for the whole library. The second method assumes that the indexing weights follow a normal distribution, leading to a normal distribution for the document scores. Furthermore, we present the first evaluation of DTF by comparing this theoretical approach with the heuristical state-of-the-art system CORI; here we find that DTF outperforms CORI in most cases.
Prior research under a variety of conditions has shown the CORI algorithm to be one of the most effective resource selection algorithms, but the range of database sizes studied was not large. This paper shows that the CORI algorithm does not do well in environments with a mix of "small" and "very large" databases. A new resource selection algorithm is proposed that uses information about database sizes as well as database contents. We also show how to acquire database size estimates in uncooperative environments as an extension of the query-based sampling used to acquire resource descriptions. Experiments demonstrate that the database size estimates are more accurate for large databases than estimates produced by a competing method; the new resource ranking algorithm is always at least as effective as the CORI algorithm; and the new algorithm results in better document rankings than the CORI algorithm.
We present SETS, an architecture for efficient search in peer-to-peer networks, building upon ideas drawn from machine learning and social network theory. The key idea is to arrange participating sites in a topic-segmented overlay topology in which most connections are short-distance, connecting pairs of sites with similar content. Topically focused sets of sites are then joined together into a single network by long-distance links. Queries are matched and routed to only the topically closest regions. We discuss a variety of design issues and tradeoffs that an implementor of SETS would face. We show that SETS is efficient in network traffic and query processing load.
Previous research in novelty detection has focused on the task of finding novel material, given a set or stream of documents on a certain topic. This study investigates the more difficult two-part task defined by the TREC 2002 novelty track: given a topic and a group of documents relevant to that topic, 1) find the relevant sentences from the documents, and 2) find the novel sentences from the collection of relevant sentences. Our research shows that the former step appears to be the more difficult part of this task, and that the performance of novelty measures is very sensitive to the presence of non-relevant sentences.
This paper presents a novel domain-independent text segmentation method, which identifies the boundaries of topic changes in long text documents and/or text streams. The method consists of three components: As a preprocessing step, we eliminate the document-dependent stop words as well as the generic stop words before the sentence similarity is computed. This step assists in the discrimination of the sentence semantic information. Then the cohesion information of sentences in a document or a text stream is captured with a sentence-distance matrix with each entry corresponding to the similarity between a sentence pair. The distance matrix can be represented with a gray-scale image. Thus, a text segmentation problem is converted into an image segmentation problem. We apply the anisotropic diffusion technique to the image representation of the distance matrix to enhance the semantic cohesion of sentence topical groups as well as sharpen topical boundaries. At last, the dynamic programming technique is adapted to find the optimal topical boundaries and provide a zoom-in and zoom-out mechanism for topics access by segmenting text in variable numbers of sentence topical groups. Our approach involves no domain-specific training, and it can be applied to texts in a variety of domains. The experimental results show that our approach is effective in text segmentation and outperforms several state-of-the-art methods.
One of the major problems in question answering (QA) is that the queries are either too brief or often do not contain most relevant terms in the target corpus. In order to overcome this problem, our earlier work integrates external knowledge extracted from the Web and WordNet to perform Event-based QA on the TREC-11 task. This paper extends our approach to perform event-based QA by uncovering the structure within the external knowledge. The knowledge structure loosely models different facets of QA events, and is used in conjunction with successive constraint relaxation algorithm to achieve effective QA. Our results obtained on TREC-11 QA corpus indicate that the new approach is more effective
We present a new method and system for performing the New Event Detection task, i.e., in one or multiple streams of news stories, all stories on a previously unseen (new) event are marked. The method is based on an incremental TF-IDF model. Our extensions include: generation of source-specific models, similarity score normalization based on document-specific averages, similarity score normalization based on source-pair specific averages, term reweighting based on inverse event frequencies, and segmentation of the documents. We also report on extensions that did not improve results. The system performs very well on TDT3 and TDT4 test data and scored second in the TDT-2002 evaluation.
Structured methods for query term replacement rely on separate estimates of term tes of replacement probabilities. Statistically significant frequency and document frequency to compute a weight for each query term. This paper reviews prior work on structured query techniques and introduces three new variants that leverage estima improvements in retrieval effectiveness are demonstrated for cross-language retrieval and for retrieval based on optical character recognition when replacement probabilities are used to estimate both term frequency and document frequency.
For cross language information retrieval (CLIR) based on bilingual translation dictionaries, good performance depends upon lexical coverage in the dictionary. This is especially true for languages possessing few inter-language cognates, such as between Japanese and English. In this paper, we describe a method for automatically creating and validating candidate Japanese transliterated terms of English words. A phonetic English dictionary and a set of probabilistic mapping rules are used for automatically generating transliteration candidates. A monolingual Japanese corpus is then used for automatically validating the transliterated terms. We evaluate the usage of the extracted English-Japanese transliteration pairs with Japanese to English retrieval experiments over the CLEF bilingual test collections. The use of our automatically derived extension to a bilingual translation dictionary improves average precision, both before and after pseudo-relevance feedback, with gains
An empirical study has been conducted investigating the relationship between the performance of an aspect based language model in terms of perplexity and the corresponding information retrieval performance obtained. It is observed, on the corpora considered, that the perplexity of the language model has a systematic relationship with the achievable precision recall performance though it is not statistically significant.
In this paper, we propose a new approach for topic distillation on World Wide Web. Topic distillation is to find quality documents related to the user query topic. Our approach is based on Bharat's topic distillation algorithm [1]. We present the analysis of hyperlink graph structure using hierarchy concept tree to solve the mixed hubs problem that is also remained in the Bharat's algorithm. For assigning better weights to hyperlinks which point to relevant documents among hyperlinks in a document, we try to find the relationship in documents connected by hyperlinks using content analysis and we assign weights to hyperlinks based on the relationship. We evaluated this algorithm using 50 topics on WT10g corpus and obtained improved results.
Information retrieval system evaluation is complicated by the need for manually assessed relevance judgments. Large manually-built directories on the web open the door to new evaluation procedures. By assuming that web pages are the known relevant items for queries that exactly match their title, we use the ODP (Open Directory Project) and Looksmart directories for system evaluation. We test our approach with a sample from a log of ten million web queries and show that such an evaluation is unbiased in terms of the directory used, stable with respect to the query set selected, and correlated with a reasonably large manual evaluation.
In this poster, we incorporate user query history, as context information, to improve the retrieval performance in interactive retrieval. Experiments using the TREC data show that incorporating such context information indeed consistently improves the retrieval performance in both average precision and precision at 20 documents.
The empirical investigation of the effectiveness of information retrieval (IR) systems requires a test collection, a set of query topics, and a set of relevance judgments made by human assessors for each query. Previous experiments show that differences in human relevance assessments do not affect the relative performance of retrieval systems. Based on this observation, we propose and evaluate a new approach to replace the human relevance judgments by an automatic method. Ranking of retrieval systems with our methodology correlates positively and significantly with that of human-based evaluations. In the experiments, we assume a Web-like imperfect environment: the indexing information for all documents is available for ranking, but some documents may not be available for retrieval. Such conditions can be due to document deletions or network problems. Our method of simulating imperfect environments can be used for Web search engine assessment and in estimating the effects of network conditions (e.g., network unreliability) on IR system performance.
Term weighting methods have been shown to give significant increases in information retrieval performance. The presence of pronomial references in documents reduces the term frequencies of associated words with a consequent effect on term weights and information retrieval behaviour. This investigation explores the impact on information retrieval performance of broad coverage automatic pronoun resolution. Results indicate that this approach has potential to improve both precision at fixed cutoff levels and average precision.
Syntactic information potentially plays a much more important role in question answering than it does in information retrieval. Although many people have used syntactic evidence in Question Answering, there haven't been many detailed experiments reported in the literature. The aim of the experiment described in this paper is to study the impact of a particular approach for using syntactic information on question answering effectiveness. Our results indicate that a combination of syntactic information with heuristics for ranking potential answers can perform better than the ranking heuristics on their own.
The effectiveness of queries in information retrieval can be improved through query expansion. This technique automatically introduces additional query terms that are statistically likely to match documents on the intended topic. However, query expansion techniques rely on fixed parameters. Our investigation of the effect of varying these parameters shows that the strategy of using fixed values is questionable.
In this paper, we introduce the fractal summarization model based on the fractal theory. In fractal summarization, the important information is captured from the source text by exploring the hierarchical structure and salient features of the document. A condensed version of the document that is informatively close to the original is produced iteratively using the contractive transformation in the fractal theory. User evaluation has shown that fractal summarization outperforms traditional summarization.
We present a unified framework for simultaneously solving both the pooling problem (the construction of efficient document pools for the evaluation of retrieval systems) and metasearch (the fusion of ranked lists returned by retrieval systems in order to increase performance). The implementation is based on the Hedge algorithm for online learning, which has the advantage of convergence to bounded error rates approaching the performance of the best linear combination of the underlying systems. The choice of a loss function closely related to the average precision measure of system performance ensures that the judged document set performs well, both in constructing a metasearch list and as a pool for the accurate evaluation of retrieval systems. Our experimental results on TREC data demonstrate excellent performance in all measures -- evaluation of systems, retrieval of relevant documents, and generation of metasearch lists.
Four statistical visual feature indexes are proposed: SLM (Shot Length Mean), the average length of each shot in a video; SLD (Shot Length Deviation), the standard deviation of shot lengths for a video; ONM (Object Number Mean), the average number of objects per frame of the video; and OND (Object Number Deviation), the standard deviation of the number of objects per frame across the video. Each of these indexes provides a unique perspective on video content. A novel video retrieval interface has been developed as a platform to examine our assumption that the new indexes facilitate some video retrieval tasks. Initial feedback is promising and formal experiments are planned.
This paper presents an approach to bilingual lexicon extraction from comparable corpora and evaluations on Cross-Language Information Retrieval. We explore a bi-directional extraction of bilingual terminology primarily from comparable corpora. A combined statistics-based and linguistics-based model to select best translation candidates to phrasal translation is proposed. Evaluations using a large test collection for Japanese-English revealed the proposed combination of bi-directional comparable corpora, bilingual dictionaries and transliteration, augmented with linguistics-based pruning to be highly effective in Cross-Language Information Retrieval.
Approaches to increase training examples to hopefully improve classification effectiveness are proposed in this work. The approaches were verified by use of two Chinese collections classified by two top-performing classifiers.
We propose a Bayesian extension to the ad-hoc Language Model. Many smoothed estimators used for the multinomial query model in ad-hoc Language Models (including Laplace and Bayes-smoothing) are approximations to the Bayesian predictive distribution. In this paper we derive the full predictive distribution in a form amenable to implementation by classical IR models, and then compare it to other currently used estimators. In our experiments the proposed model outperforms Bayes-smoothing, and its combination with linear interpolation smoothing outperforms all other estimators.
This paper describes progress towards a general framework for incorporating multimodal cues into a trainable system for automatically annotating user-defined semantic concepts in broadcast video. Models of arbitrary concepts are constructed by building classifiers in a score space defined by a pre-deployed set of multimodal models. Results show annotation for user-defined concepts both in and outside the pre-deployed set is competitive with our best video-only models on the TREC Video 2002 corpus. An interesting side result shows speech-only models give performance comparable to our best video-only models for detecting visual concepts such as "outdoors", "face" and "cityscape".
Given the experimental nature of information retrieval, progress critically depends on analyzing the errors made by existing retrieval approaches and understanding their limitations. Our research explores various hypothesized reasons for hard topics in TREC-8 ad hoc task, and shows that the bad performance is partially due to the existence of highly distracting sub-collections that can dominate the overall performance.
Passage retrieval is an important component common to many question answering systems. Because most evaluations of question answering systems focus on end-to-end performance, comparison of common components becomes difficult. To address this shortcoming, we present a quantitative evaluation of various passage retrieval algorithms for question answering, implemented in a framework called Pauchok. We present three important findings: Boolean querying schemes perform well in the question answering task. The performance differences between various passage retrieval algorithms vary with the choice of document retriever, which suggests significant interactions between document retrieval and passage retrieval. The best algorithms in our evaluation employ density-based measures for scoring query terms. Our results reveal future directions for passage retrieval and question answering.
In this poster, we present a model of the flow of information among bioinformatics resources in the context of a specific scientific problem. Combining task analysis with traditional, qualitative research, we determined the extent to which the bioinformatics analysis process could be automated. The model represents a semi-automated process, involving fourteen distinct data processing steps, and forms the framework for an interface to bioinformatics information.
Stemming can improve retrieval accuracy, but stemmers are language-specific. Character n-gram tokenization achieves many of the benefits of stemming in a language independent way, but its use incurs a performance penalty. We demonstrate that selection of a single n-gram as a pseudo-stem for a word can be an effective and efficient language-neutral approach for some languages.
Industry professionals and everyday users of the Internet have long accepted that due to both the size and growth of this ubiquitous repository, new tools are needed to assist with the finding and extraction of very specific resources relevant to a user's task. Previously, this definition of relevance has been based on the extremely generic matching between resources and query terms, but recently the emphasis is shifting towards a more personalised model based on the relevance of a particular resource for one specific user. We introduce a prototype, \tt Fetch, which adopts this concept within an information-seeking environment specifically designed to provide users with the means to better describe a problem (s)he doesn't understand.
The World Wide Web contains a number of source code archives. Programs are usually classified into various categories within the archive by hand. We report on experiments for automatic classification of source code into these categories. We examined a number of factors that affect classification accuracy. Weighting features by expected entropy loss makes a significant improvement in classification accuracy. We show a Support Vector Machine can be trained to classify source code with a high degree of accuracy. We feel these results show promise for software reuse.
Use of semantic content is one of the major issues which needs to be addressed for improving image retrieval effectiveness. We present a new approach to classify images based on the combination of image processing techniques and hybrid neural networks. Multiple keywords are assigned to an image to represent its main contents, i.e. semantic content. Images are divided into a number of regions and colour and texture features are extracted. The first classifier, a self-organising map (SOM) clusters similar images based on the extracted features. Then, regions of the representative images of these clusters were labeled and used to train the second classifier, composed of several support vector machines (SVMs). Initial experiments on the accuracy of keyword assignment for a small vocabulary are reported.
Latent Dirichlet Allocation (LDA) is a fully generative approach to language modelling which overcomes the inconsistent generative semantics of Probabilistic Latent Semantic Indexing (PLSI). This paper shows that PLSI is a maximum a posteriori estimated LDA model under a uniform Dirichlet prior, therefore the perceived shortcomings of PLSI can be resolved and elucidated within the LDA framework.
Web search query logs contain traces of users' search modifications. One strategy users employ is deleting terms, presumably to obtain greater coverage. It is useful to model and automate term deletion when arbitrary searches are conjunctively matched against a small hand constructed collection, such as a hand-built hierarchy, or collection of high-quality pages matched with key phrases. Queries with no matches can have words deleted till a match is obtained. We provide algorithms which perform substantially better than the baseline in predicting which word should be deleted from a reformulated query, for increasing query coverage in the context of web search on small high-quality collections.
In this poster, we describe an experiment exploring the effectiveness of a pen based text input device for use in query construction. Standard TREC queries were written, recognised, and subsequently retrieved upon. Comparisons between retrieval effectiveness based on the recognised writing and a typed text baseline were made. On average, effectiveness was 75% of the baseline. Other statistics on the quality and nature of recognition are also reported.
This short paper presents a light weight technique to merge results lists obtained from querying different databases. The motivation for such a technique is a general purpose search engine for Palm-OS based PDAs.
This paper describes an automatic content indexing system for news programs, with a special emphasis on its segmentation process. The process can successfully segment an entire news program into topic-centered news stories; the primary tool is a linguistic topic segmentation algorithm. Experiments show that the resulting speech-based segments are fairly accurate, and scene change points supplied by an external video processor can be of help in improving segmentation effectiveness.
In general terms the evaluation of a summary depends on how close it is to the chief points in the source text. This begets the question as to what are the chief points in the source text and how is this information used in itself in identifying the source text. This is crucially important when we discuss automatic evaluation of summaries. So the question of main points is the source text. Typically, this would be around a nucleus of keywords. However, the salience, the frequency, and the relationship of the text with other texts in the collection (of these keywords is perhaps) are important. Text categorisation using neural networks explicates these points well and also has a practical impact.
This paper introduces a rule-based, context-dependent word clustering method, with the rules derived from various domain databases and the word text orthographic properties. Besides significant dimensionality reduction, our experiments show that such rule-based word clustering improves by 8 the overall accuracy of extracting bibliographic fields from references, and by 18.32 on average the class-specific performance on the line classification of document headers.
A novel Hardware Assisted Top-Doc (HAT) component is disclosed. HAT is an optimized content indexing device based on a modified inverted index structure. HAT accommodates patterns of different lengths and supports a varied posting list versus term count feature sustaining high reusability and efficiency. The developed component can be used either as an internal slave component or as an external co-processor and is efficient in resource demands as the component controllers take only a minimal percentage of the target device space leaving the majority of the space to term and posting entries. A Very High Speed Integrated Circuit (VHSIC) Hardware Description Language (VHDL) is used to model the HAT system.
Recent work has demonstrated that the assessment of pairwise object similarity can be approached in an axiomatic manner using information theory. We extend this concept specifically to document similarity and test the effectiveness of an information-theoretic measure for pairwise document similarity. We adapt query retrieval to rate the quality of document similarity measures and demonstrate that our proposed information-theoretic measure for document similarity yields statistically significant improvements over other popular measures of similarity.
We describe an efficient, robust method for selecting and optimizing terms for a classification or filtering task. Terms are extracted from positive examples in training data based on several alternative term-selection algorithms, then combined additively after a simple term-score normalization step to produce a merged and ranked master term vector. The score threshold for the master vector is set via beta-gamma regulation over all the available training data. The process avoids para-meter calibrations and protracted training. It also results in compact profiles for run-time evaluation of test (new) documents. Results on TREC-2002 filtering-task datasets demonstrate substantial improvements over TREC-median results and rival both idealized IR-based results and optimized (and expensive) SVM-based classifiers in general effectiveness.
This poster reports upon the ongoing efforts being made to establish TREC-like and other comprehensive evaluation paradigms within the Music IR (MIR) and Music Digital Library (MDL) research communities. The proposed research tasks are based upon expert opinion garnered from members of the Information Retrieval (IR), MDL and MIR communities with regard to the construction and implementation of scientifically valid evaluation frameworks.
Hierarchies provide a means of organizing, summarizing and accessing information. We describe a method for automatically generating hierarchies from small collections of text, and then apply this technique to summarizing the documents retrieved by a search engine.
This demonstration will describe how Timber, a native XML database system, has been extended with the capability to answer XML-style structured queries (e.g., XQuery) with embedded IR-style keyword-based non-boolean conditions. With the original structured query processing engine and the IR extensions built into the system, Timber is well suited for efficiently and effectively processing queries with both structural and textual content constraints.
We present a new tool for gathering textual information according to a query (texts) on arbitrary web sites specified by an information-seeking user. This tool is helpful in any knowledge-intensive area. Its technology is based on the vector space model with optimized feature definition.
We present eArchivarius an interactive system for accessing collections of electronic mail. The system combines search, clustering visualization, and time-based visualization of email messages and people who send or received the messages.
Current Web search engines generally impose link analysis-based re-ranking on web-page retrieval. However, the same techniques, when applied directly to small web search such as intranet and site search, cannot achieve the same performance because their link structures are different from the global Web. In this paper, we propose an approach to constructing implicit links by mining users' access patterns, and then apply a modified PageRank algorithm to re-rank web-pages for small web search. Our experimental results indicate that the
The heterogeneous Web exacerbates IR problems and short user queries make them worse. The contents of web documents are not enough to find good answer documents. Link information and URL information compensates for the insufficiencies of content information. However, static combination of multiple evidences may lower the retrieval performance. We need different strategies to find target documents according to a query type. We can classify user queries as three categories, the topic relevance task, the homepage finding task, and the service finding task. In this paper, a user query classification scheme is proposed. This scheme uses the difference of distribution, mutual information, the usage rate as anchor texts, and the POS information for the classification. After we classified a user query, we apply different algorithms and information for the better results. For the topic relevance task, we emphasize the content information, on the other hand, for the homepage finding task, we emphasize the Link information and the URL information. We could get the best performance when our proposed classification method with the OKAPI scoring algorithm was used.
Most information retrieval technologies are designed to facilitate information discovery. However, much knowledge work involves finding and re-using previously seen information. We describe the design and evaluation of a system, called Stuff I've Seen (SIS), that facilitates information re-use. This is accomplished in two ways. First, the system provides a unified index of information that a person has seen, whether it was seen as email, web page, document, appointment, etc. Second, because the information has been seen before, rich contextual cues can be used in the search interface. The system has been used internally by more than 230 employees. We report on both qualitative and quantitative aspects of system use. Initial findings show that time and people are important retrieval cues. Users find information more easily using SIS, and use other search tools less frequently after installation.
Although interactive query reformulation has been actively studied in the laboratory, little is known about the actual behavior of web searchers who are offered terminological feedback along with their search results. We analyze log sessions for two groups of users interacting with variants of the AltaVista search engine -- a baseline group given no terminological feedback and a feedback group to whom twelve refinement terms are offered along with the search results. We examine uptake, refinement effectiveness, conditions of use, and refinement type preferences. Although our measure of overall session "success" shows no difference between outcomes for the two groups, we find evidence that a subset of those users presented with terminological feedback do make effective use of it on a continuing basis.
Real-world applications of text categorization often require a system to deal with tens of thousands of categories defined over a large taxonomy. This paper addresses the problem with respect to a set of popular algorithms in text categorization, including Support Vector Machines, k-nearest neighbor, ridge regression, linear least square fit and logistic regression. By providing a formal analysis of the computational complexity of each classification method, followed by an investigation on the usage of different classifiers in a hierarchical setting of categorization, we show how the scalability of a method depends on the topology of the hierarchy and the category distributions. In addition, we are able to obtain tight bounds for the complexities by using the power law to approximate category distributions over a hierarchy. Experiments with kNN and SVM classifiers on the OHSUMED corpus are reported on, as concrete examples.
Give us your opinion! Do you have any comments/additions that you would like other visitors to see?
Yousay:
Mar 21st, 2010
#1
Be the first to add a thoughtful note to this page !
Changes to this page (conference)
24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was edited 24 Jun 2007: Conference Proceedings was added to the bibliography
Software design is the act of determining the user's experience with a piece of software. It has nothing to do with how the code works inside, or how big or small the code is. The designer's task is to specify completely and unambiguously the user's whole experience.
-- David Liddle, From Bringing Design to Software, edited by Terry Winograd, 1996
”
Page maintainer: The Editorial Team How to cite/reference this page
URL: http://www.interaction-design.org/references/conferences/proceedings_of_the_26th_annual_international_acm_sigir_conference_on_research_and_development_in_information_retrieval.html