It is easy for me to access this knowledge pool, I want it to grow so that I can grow along

Last 3 Donors


Support us

Funding progress for 2010:

Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval


 
Time and place:

2002
Series:
Conf. description:
SIGIR is the major international forum for the presentation of new research results and the demonstration of new systems and techniques in the field of information retrieval.
Help us!
Do you know when the next conference is? If yes, please add it to the calendar!
Publisher:
EDIT

References from this conference (2002)

The following articles are from "Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval":

 what's this?

Articles

p. 1

Rijsbergen, Keith van (2002): Landmarks in information retrieval: the message out of the bottle. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 1. Available online

For many years I have wanted to give a talk like this: look back on our subject, identify the high (and perhaps low) points, consider what worked, what did not work, and speculate a little about the future. Now that I at last have the opportunity to give such a talk the realisation has dawned just how difficult it is to do justice to the topic. The only way out of this difficulty for me is to emphasise that this is a personal account, based on my involvement with the field since 1968, and that errors of omission and commission are not deliberate but simply due to lack of knowledge and time on my part.

Copyrights may apply

p. 105-112

Amini, Massih-Reza and Gallinari, Patrick (2002): The use of unlabeled data to improve supervised learning for text summarization. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 105-112. Available online

With the huge amount of information available electronically, there is an increasing demand for automatic text summarization systems. The use of machine learning techniques for this task allows one to adapt summaries to the user needs and to the corpus characteristics. These desirable properties have motivated an increasing amount of work in this field over the last few years. Most approaches attempt to generate summaries by extracting sentence segments and adopt the supervised learning paradigm which requires to label documents at the text span level. This is a costly process, which puts strong limitations on the applicability of these methods. We investigate here the use of semi-supervised algorithms for summarization. These techniques make use of few labeled data together with a larger amount of unlabeled data. We propose new semi-supervised algorithms for training classification models for text summarization. We analyze their performances on two data sets -- the Reuters news-wire corpus and the Computation and Language (cmp_lg) collection of TIPSTER SUMMAC. We perform comparisons with a baseline -- non learning -- system, and a reference trainable summarizer system.

Copyrights may apply

p. 11-18

Park, Seung-Taek, Pennock, David M., Giles, Clyde Lee and Krovetz, Robert (2002): Analysis of lexical signatures for finding lost or related documents. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 11-18. Available online

A lexical signature of a web page is often sufficient for finding the page, even if its URL has changed. We conduct a large-scale empirical study of eight methods for generating lexical signatures, including Phelps and Wilensky's [14] original proposal (PW) and seven of our own variations. We examine their performance on the web and on a TREC data set, evaluating their ability both to uniquely identify the original document and to locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. In general, TFIDF-based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates for generating effective lexical signatures.

Copyrights may apply

p. 113-120

Zha, Hongyuan (2002): Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 113-120. Available online

A novel method for simultaneous keyphrase extraction and generic text summarization is proposed by modeling text documents as weighted undirected and weighted bipartite graphs. Spectral graph clustering algorithms are useed for partitioning sentences of the documents into topical groups with sentence link priors being exploited to enhance clustering quality. Within each topical group, saliency scores for keyphrases and sentences are generated based on a mutual reinforcement principle. The keyphrases and sentences are then ranked according to their saliency scores and selected for inclusion in the top keyphrase list and summaries of the document. The idea of building a hierarchy of summaries for documents capturing different levels of granularity is also briefly discussed. Our method is illustrated using several examples from news articles, news broadcast transcripts and web documents.

Copyrights may apply

p. 121-128

Hardy, Hilda, Shimizu, Nobuyuki, Strzalkowski, Tomek, Ting, Liu, Zhang, Xinyang and Wise, G. Bowden (2002): Cross-document summarization by concept classification. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 121-128. Available online

In this paper we describe a Cross Document Summarizer XDoX designed specifically to summarize large document sets (50-500 documents and more). Such sets of documents are typically obtained from routing or filtering systems run against a continuous stream of data, such as a newswire. XDoX works by identifying the most salient themes within the set (at the granularity level that is regulated by the user) and composing an extraction summary, which reflects these main themes. In the current version, XDoX is not optimized to produce a summary based on a few unrelated documents; indeed, such summaries are best obtained simply by concatenating summaries of individual documents. We show examples of summaries obtained in our tests as well as from our participation in the first Document Understanding Conference (DUC).

Copyrights may apply

p. 129-136

Slonim, Noam, Friedman, Nir and Tishby, Naftali (2002): Unsupervised document classification using sequential information maximization. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 129-136. Available online

We present a novel sequential clustering algorithm which is motivated by the Information Bottleneck (IB) method. In contrast to the agglomerative IB algorithm, the new sequential (sIB) approach is guaranteed to converge to a local maximum of the information with time and space complexity typically linear in the data size. information, as required by the original IB principle. Moreover, the time and space complexity are significantly improved. We apply this algorithm to unsupervised document classification. In our evaluation, on small and medium size corpora, the sIB is found to be consistently superior to all the other clustering methods we examine, typically by a significant margin. Moreover, the sIB results are comparable to those obtained by a supervised Naive Bayes classifier. Finally, we propose a simple procedure for trading cluster's recall to gain higher precision, and show how this approach can extract clusters which match the existing topics of the corpus almost perfectly.

Copyrights may apply

p. 137-144

Kawatani, Takahiko (2002): Topic difference factor extraction between two document sets and its application to text categorization. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 137-144. Available online

To improve performance in text categorization, it is important to extract distinctive features for each class. This paper proposes topic difference factor analysis (TDFA) as a method to extract projection axes that reflect topic differences between two document sets. Suppose all sentence vectors that compose each document are projected onto projection axes. TDFA obtains the axes that maximize the ratio between the document sets as to the sum of squared projections by solving a generalized eigenvalue problem. The axes are called topic difference factors (TDF's). By applying TDFA to the document set that belongs to a given class and a set of documents that is misclassified as belonging to that class by an existent classifier, we can obtain features that take large values in the given class but small ones in other classes, as well as features that take large values in other classes but small ones in the given class. A classifier was constructed applying the above features to complement the kNN classifier. As the results, the micro averaged F1 measure for

Copyrights may apply

p. 145-150

Lee, Yong-Bae and Myaeng, Sung Hyon (2002): Text genre classification with genre-revealing and subject-revealing features. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 145-150. Available online

Subject or prepositional content has been the focus of most classification research. Genre or style, on the other hand, is a different and important property of text, and automatic text genre classification is becoming important for classification and retrieval purposes as well as for some natural language processing research. In this paper, we present a method for automatic genre classification that is based on statistically selected features obtained from both subject-classified and genre classified training data. The experimental results show that the proposed method outperforms a direct application of a statistical learner often used for subject classification. We also observe that the deviation formula and discrimination formula using document frequency ratios also work as expected. We conjecture that this dual feature set approach can be generalized to improve the performance of subject classification as well.

Copyrights may apply

p. 151-158

Crammer, Koby and Singer, Yoram (2002): A new family of online algorithms for category ranking. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 151-158. Available online

We describe a new family of topic-ranking algorithms for multi-labeled documents. The motivation for the algorithms stems from recent advances in online learning algorithms. The algorithms we present are simple to implement and are time and memory efficient. We evaluate the algorithms on the Reuters-21578 corpus and the new corpus released by Reuters in 2000. On both corpora the algorithms we present outperform adaptations to topic-ranking of Rocchio's algorithm and the Perceptron algorithm. We also outline the formal analysis of the algorithm in the mistake bound model. To our knowledge, this work is the first to report performance results with the entire new Reuters corpus.

Copyrights may apply

p. 167-174

Federico, Marcello and Bertoldi, Nicola (2002): Statistical cross-language information retrieval using n-best query translations. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 167-174. Available online

This paper presents a novel statistical model for cross-language information retrieval. Given a written query in the source language, documents in the target language are ranked by integrating probabilities computed by two statistical models: a query-translation model, which generates most probable term-by-term translations of the query, and a query-document model, which evaluates the likelihood of each document and translation. Integration of the two scores is performed over the set of N most probable translations of the query. Experimental results with values N=1, 5, 10 are presented on the Italian-English bilingual track data used in the CLEF 2000 and 2001 evaluation campaigns.

Copyrights may apply

p. 175-182

Lavrenko, Victor, Choquette, Martin and Croft, W. Bruce (2002): Cross-lingual relevance models. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 175-182. Available online

We propose a formal model of Cross-Language Information Retrieval that does not rely on either query translation or document translation. Our approach leverages recent advances in language modeling to directly estimate an accurate topic model in the target language, starting with a query in the source language. The model integrates popular techniques of disambiguation and query expansion in a unified formal framework. We describe how the topic model can be estimated with either a parallel corpus or a dictionary. We test the framework by constructing Chinese topic models from English queries and using them in the CLIR task of TREC9. The model achieves performance around 95% of the strong mono-lingual baseline in terms of average precision. In initial precision, our

Copyrights may apply

p. 183-190

Gao, Jianfeng, Zhou, Ming, Nie, Jian-Yun, He, Hongzhao and Chen, Weijun (2002): Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 183-190. Available online

Bilingual dictionaries have been commonly used for query translation in cross-language information retrieval (CLIR). However, we are faced with the problem of translation selection. Several recent studies suggested the utilization of term co-occurrences in this selection. This paper presents two extensions to improve them. First, we extend the basic co-occurrence model by adding a decaying factor that decreases the mutual information when the distance between the terms increases. Second, we incorporate a triple translation model, in which syntactic dependence relations (represented as triples) are integrated. Our evaluation on translation accuracy shows that translating triples as units is more precise than a word-by-word translation. Our CLIR experiments show that the addition of the decaying factor leads to substantial improvements of the basic co-occurrence model; and the triple translation model brings further improvements.

Copyrights may apply

p. 19-26

Si, Luo and Callan, Jamie (2002): Using sampled data and regression to merge search engine results. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 19-26. Available online

This paper addresses the problem of merging results obtained from different databases and search engines in a distributed information retrieval environment. The prior research on this problem either assumed the exchange of statistics necessary for normalizing scores (cooperative solutions) or is heuristic. Both approaches have disadvantages. We show that the problem in uncooperative environments is simpler when viewed as a component of a distributed IR system that uses query-based sampling to create resource descriptions. Documents sampled for creating resource descriptions can also be used to create a sample centralized index, and this index is a source of training data for adaptive results merging algorithms. A variety of experiments demonstrate that this new approach is more effective than a well-known alternative, and that it allows query-by-query tuning of the results merging function.

Copyrights may apply

p. 191-198

Liu, Xin, Gong, Yihong, Xu, Wei and Zhu, Shenghuo (2002): Document clustering with cluster refinement and model selection capabilities. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 191-198. Available online

In this paper, we propose a document clustering method that strives to achieve: (1) a high accuracy of document clustering, and (2) the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability). To accurately cluster the given document corpus, we employ a richer feature set to represent each document, and use the Gaussian Mixture Model (GMM) together with the Expectation-Maximization (EM) algorithm to conduct an initial document clustering. From this initial result, we identify a set of discriminative features for each cluster, and refine the initially obtained document clusters by voting on the cluster label of each document using this discriminative feature set. This self-refinement process of discriminative feature identification and cluster label voting is iteratively applied until the convergence of document clusters. On the other hand, the model selection capability is achieved by introducing randomness in the cluster initialization stage, and then discovering a value C for the number of clusters N by which running the document clustering process for a fixed number of times yields sufficiently similar results. Performance evaluations exhibit clear superiority of the proposed method with its improved document clustering and model selection accuracies. The evaluations also demonstrate how each feature as well as the cluster refinement process contribute to the document clustering accuracy.

Copyrights may apply

p. 199-206

Pantel, Patrick and Lin, Dekang (2002): Document clustering with committees. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 199-206. Available online

Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoo-like hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intra-group similarity), called committees, that are well scattered in the similarity space (low inter-group similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.

Copyrights may apply

p. 207-214

Bennett, Paul N., Dumais, Susan and Horvitz, Eric (2002): Probabilistic combination of text classifiers using reliability indicators: models and results. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 207-214. Available online

The intuition that different text classifiers behave in qualitatively different ways has long motivated attempts to build a better metaclassifier via some combination of classifiers. We introduce a probabilistic method for combining classifiers that considers the context-sensitive reliabilities of contributing classifiers. The method harnesses reliability indicators -- variables that provide a valuable signal about the performance of classifiers in different situations. We provide background, present procedures for building metaclassifiers that take into consideration both reliability indicators and classifier outputs, and review a set of comparative studies undertaken to evaluate the methodology.

Copyrights may apply

p. 215-221

Bahle, Dirk, Williams, Hugh E. and Zobel, Justin (2002): Efficient phrase querying with an auxiliary index. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 215-221. Available online

Search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this paper we consider how phrase queries can be efficiently supported with low disk overheads. Previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. We propose a combination of nextword indexes with inverted files as a solution to this problem. Our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. Further time savings are available with only slight increases in disk requirements.

Copyrights may apply

p. 230-237

Possas, Bruno, Ziviani, Nivio, Meira, Wagner and Ribeiro-Neto, Berthier A. (2002): Set-based model: a new approach for information retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 230-237. Available online

The objective of this paper is to present a new technique for computing term weights for index terms, which leads to a new ranking mechanism, referred to as set-based model. The components in our model are no longer terms, but termsets. The novelty is that we compute term weights using a data mining technique called association rules, which is time efficient and yet yields nice improvements in retrieval effectiveness. The set-based model function for computing the similarity between a document and a query considers the termset frequency in the document and its scarcity in the document collection. Experimental results show that our model improves the average precision of the answer set for all three collections evaluated. For the TReC-3 collection, our set-based model led to a gain, relative to the standard vector space model, of 37% in average precision curves and of 57% in average precision for the top 10 documents. Like the vector space model, the set-based model has time complexity that is linear in the number of documents in the collection.

Copyrights may apply

p. 238-245

Canny, John (2002): Collaborative filtering with privacy via factor analysis. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 238-245. Available online

Collaborative filtering (CF) is valuable in e-commerce, and for direct recommendations for music, movies, news etc. But today's systems have several disadvantages, including privacy risks. As we move toward ubiquitous computing, there is a great potential for individuals to share all kinds of information about places and things to do, see and buy, but the privacy risks are severe. In this paper we describe a new method for collaborative filtering which protects the privacy of individual data. The method is based on a probabilistic factor analysis model. Privacy protection is provided by a peer-to-peer protocol which is described elsewhere, but outlined in this paper. The factor analysis approach handles missing data without requiring default values for them. We give several experiments that suggest that this is most accurate method for CF to date. The new algorithm has other advantages in speed and storage over previous algorithms. Finally, we suggest applications of the approach to other kinds of statistical analyses of survey or questionnaire data.

Copyrights may apply

p. 246-252

Coster, Rickard and Svensson, Martin (2002): Inverted file search algorithms for collaborative filtering. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 246-252. Available online

This paper explores the possibility of using a disk based inverted file structure for collaborative filtering. Our hypothesis is that this allows for faster calculation of predictions and also that early termination heuristics may be used to further speed up the filtering process and perhaps even improve the quality of the predictions. In an experiment on the EachMovie dataset this was tested. Our results indicate that searching the inverted file structure is many times faster than general in-memory vector search, even for very large profiles. The Continue termination heuristics produces the best ranked predictions in our experiments, and Quit is the top performer in terms of speed.

Copyrights may apply

p. 253-260

Schein, Andrew I., Popescul, Alexandrin, Ungar, Lyle H. and Pennock, David M. (2002): Methods and metrics for cold-start recommendations. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 253-260. Available online

We have developed a method for recommending items that combines content and collaborative data under a single probabilistic framework. We benchmark our algorithm against a naive Bayes classifier on the cold-start problem, where we wish to recommend items that no one in the community has yet rated. We systematically explore three testing methodologies using a publicly available data set, and explain how these methods apply to specific real-world applications. We advocate heuristic recommenders when benchmarking to give competent baseline performance. We introduce a new performance metric, the CROC curve, and demonstrate empirically that the various components of our testing strategy combine to obtain deeper understanding of the performance characteristics of recommender systems. Though the emphasis of our testing is on cold-start recommending, our methods for recommending and evaluation are general.

Copyrights may apply

p. 261-268

Darwish, Kareem and Oard, Douglas W. (2002): Term selection for searching printed Arabic. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 261-268. Available online

Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.

Copyrights may apply

p. 269-274

Xu, Jinxi, Fraser, Alexander and Weischedel, Ralph (2002): Empirical studies in strategies for Arabic retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 269-274. Available online

This work evaluates a few search strategies for Arabic monolingual and cross-lingual retrieval, using the TREC Arabic corpus as the test-bed. The release by NIST in 2001 of an Arabic corpus of nearly 400k documents with both monolingual and cross-lingual queries and relevance judgments has been a new enabler for empirical studies. Experimental results show that spelling normalization and stemming can significantly improve Arabic monolingual retrieval. Character tri-grams from stems improved retrieval modestly on the test corpus, but the improvement is not statistically significant. To further improve retrieval, we propose a novel thesaurus-based technique. Different from existing approaches to thesaurus-based retrieval, ours formulates word synonyms as probabilistic term translations that can be automatically derived from a parallel corpus. Retrieval results show that the thesaurus can significantly improve Arabic monolingual retrieval. For cross-lingual retrieval (CLIR), we found that spelling normalization and stemming have little impact.

Copyrights may apply

p. 27-34

Kraaij, Wessel, Westerveld, Thijs and Hiemstra, Djoerd (2002): The Importance of Prior Probabilities for Entry Page Search. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 27-34. Available online

An important class of searches on the world-wide-web has the goal to find an entry page (homepage) of an organisation. Entry page search is quite different from Ad Hoc search. Indeed a plain Ad Hoc system performs disappointingly. We explored three non-content features of web pages: page length, number of incoming links and URL form. Especially the URL form proved to be a good predictor. Using URL form priors we found over 70% of all entry pages at rank 1, and up to 89% in the top 10. Non-content features can easily be embedded in a language model framework as a prior probability.

Copyrights may apply

p. 275-282

Larkey, Leah S., Ballesteros, Lisa and Connell, Margaret E. (2002): Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 275-282. Available online

Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stemming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence for Arabic retrieval. We compared the retrieval effectiveness of our stemmers and of a morphological analyzer on the TREC-2001 data. The best light stemmer was more effective for cross-language retrieval than a morphological stemmer which tried to find the root for each word. A repartitioning process consisting of vowel removal followed by clustering using co-occurrence analysis produced stem classes which were better than no stemming or very light stemming, but still inferior to good light stemming or morphological analysis.

Copyrights may apply

p. 283-290

Carmel, David, Farchi, Eitan, Petruschka, Yael and Soffer, Aya (2002): Automatic query refinement using lexical affinities with maximal information gain. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 283-290. Available online

This work describes an automatic query refinement technique, which focuses on improving precision of the top ranked documents. The terms used for refinement are lexical affinities (LAs), pairs of closely related words which contain exactly one of the original query terms. Adding these terms to the query is equivalent to re-ranking search results, thus, precision is improved while recall is preserved. We describe a novel method that selects the most "informative" LAs for refinement, namely, those LAs that best separate relevant documents from irrelevant documents in the set of results. The information gain of candidate LAs is determined using unsupervised estimation that is based on the scoring function of the search engine. This method is thus fully automatic and its quality depends on the quality of the scoring function. Experiments we conducted with TREC data clearly show a significant improvement in the precision of the top ranked documents.

Copyrights may apply

p. 291-298

Dumais, Susan, Banko, Michele, Brill, Eric, Lin, Jimmy and Ng, Andrew (2002): Web question answering: is more always better?. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 291-298. Available online

This paper describes a question answering system that is designed to capitalize on the tremendous amount of data that is now available online. Most question answering systems use a wide variety of linguistic resources. We focus instead on the redundancy available in large corpora as an important resource. We use this redundancy to simplify the query rewrites that we need to use, and to support answer mining from returned snippets. Our system performs quite well given the simplicity of the techniques being utilized. Experimental results show that question answering accuracy can be greatly improved by analyzing more and more matching passages. Simple passage ranking and n-gram extraction techniques work well in our system making it efficient to use with many backend retrieval engines.

Copyrights may apply

p. 299-306

Cronen-Townsend, Steve, Zhou, Yun and Croft, W. Bruce (2002): Predicting query performance. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 299-306. Available online

We develop a method for predicting query performance by computing the relative entropy between a query language model and the corresponding collection language model. The resulting clarity score measures the coherence of the language usage in documents whose models are likely to generate the query. We suggest that clarity scores measure the ambiguity of a query with respect to a collection of documents and show that they correlate positively with average precision in a variety of TREC test sets. Thus, the clarity score may be used to identify ineffective queries, on average, without relevance information. We develop an algorithm for automatically setting the clarity score threshold between predicted poorly-performing queries and acceptable queries and validate it using TREC data. In particular, we compare the automatic thresholds to optimum thresholds and also check how frequently results as good are achieved in sampling experiments that randomly assign queries to the two classes.

Copyrights may apply

p. 3-10

Anh, Vo Ngoc and Moffat, Alistair (2002): Impact transformation: effective and efficient web retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 3-10. Available online

We extend the applicability of impact transformation, which is a technique for adjusting the term weights assigned to documents so as to boost the effectiveness of retrieval when short queries are applied to large document collections. In conjunction with techniques called quantization and thresholding, impact transformation allows improved query execution rates compared to traditional vector-space similarity computations, as the number of arithmetic operations can be reduced. The transformation also facilitates a new dynamic query pruning heuristic. We give results based upon the trec web data that show the combination of these various techniques to yield highly competitive retrieval, in terms of both effectiveness and efficiency, for both short and long queries.

Copyrights may apply

p. 307-314

Allan, James and Raghavan, Hema (2002): Using part-of-speech patterns to reduce query ambiguity. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 307-314. Available online

Query ambiguity is a generally recognized problem, particularly in Web environments where queries are commonly only one or two words in length. In this study, we explore one technique that finds commonly occurring patterns of parts of speech near a one-word query and allows them to be transformed into clarification questions. We use a technique derived from statistical language modeling to show that the clarification queries will reduce ambiguity much of the time, and often quite substantially.

Copyrights may apply

p. 315

Koskenniemi, Kimmo (2002): Is natural language an inconvenience or an opportunity for IR?. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 315. Available online

Natural language (NL) has evolved to facilitate human communication. It enables the speaker to make the listener's mind wander among her experiences and mental associations roughly according to the intentions of the speaker. The speaker and the listener usually share experiences and expectations, and they use mostly the same units and rules of a shared NL. Written language functions similarly, but in a less interactive way, with fewer possibilities for feedback. Both the symbols of NL (i.e. words or morphemes), and their arrangements are meaningful. Not with universal and precise meanings, but similar enough among different speakers and accurate enough for the communication mostly to succeed. NLs are mostly very large systems. Hundreds of thousands of words and infinitely many possible utterances. Even inflection alone might produce huge numbers of forms, e.g. more than ten thousand distinct forms out of every Finnish verb entry. NL processing (for IR or any other purpose) must cope with phenomena like (1) inflection and compounding, (2) synonymy, (3) polysemy, (4) ambiguity, (5) anaphora and (6) head-modifier relations among words and phrases. Language technology can neutralize much of the effect of these 'inconveniences' inherent with NL, but what kinds of advantages could NL have? * Redundant use of synonymous expressions can effectively identify new concepts. * Multilingual parallel documents may help in identifying their exact content. * NLs typically carry connotations, i.e. what is implied but not explicitly said (e.g. attitudes, politeness). * Vague associations are easy to express in NL, but not always in formal systems (e.g. "a few years ago there was an article about the rival of Yeltsin -- I don't remember his name but -- he then went over to some region in Siberia -- but what did the guy promise?") * Jokes and humor belong to NLs, not to formal systems. Are there any alternatives for NL? Not really, because any artificial and more precise formalisms fail to adapt to new concepts and they do not easily allow restructuring of previous ideas. One challenge for language technology is to find better solutions for the above 'inconveniences' in order to provide various IR, document classification, indexing and summarizing methods with more accurate and adequate input data. With more accurate input some of the more demanding tasks of IR can perhaps be solved.

Copyrights may apply

p. 316-323

Voorhees, Ellen M. and Buckley, Chris (2002): The effect of topic set size on retrieval experiment error. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 316-323. Available online

Retrieval mechanisms are frequently compared by computing the respective average scores for some effectiveness metric across a common set of information needs or topics, with researchers concluding one method is superior based on those averages. Since comparative retrieval system behavior is known to be highly variable across topics, good experimental design requires that a "sufficient" number of topics be used in the test. This paper uses TREC results to empirically derive error rates based on the number of topics used in a test and the observed difference in the average scores. The error rates quantify the likelihood that a different set of topics of the same size would lead to a different conclusion. We directly compute error rates for topic sets up to size 25, and extrapolate those rates for larger topic set sizes. The error rates found are larger than anticipated, indicating researchers need to take care when concluding one method is better than another, especially if few topics are used.

Copyrights may apply

p. 324-330

Sormunen, Eero (2002): Liberal relevance criteria of TREC: counting on negligible documents?. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 324-330. Available online

Most test collections (like TREC and CLEF) for experimental research in information retrieval apply binary relevance assessments. This paper introduces a four-point relevance scale and reports the findings of a project in which TREC-7 and TREC-8 document pools on 38 topics were reassessed. The goal of the reassessment was to build a subcollection of TREC for experiments on highly relevant documents and to learn about the assessment process as well as the characteristics of a multigraded relevance corpus. Relevance criteria were defined so that a distinction was made between documents rich in topical information (relevant and highly relevant documents) and poor in topical information (marginally relevant documents). It turned out that about 50% of documents assessed as relevant were regarded as marginal. The characteristics of the relevance corpus and lessons learned from the reassessment project are discussed. The need to develop more elaborated relevance assessment schemes is emphasized.

Copyrights may apply

p. 331-338

Shalev-Shwartz, Shai, Dubnov, Shlomo, Friedman, Nir and Singer, Yoram (2002): Robust temporal and spectral modeling for query By melody. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 331-338. Available online

Query by melody is the problem of retrieving musical performances from melodies. Retrieval of real performances is complicated due to the large number of variations in performing a melody and the presence of colored accompaniment noise. We describe a simple yet effective probabilistic model for this task. We describe a generative model that is rich enough to capture the spectral and temporal variations of musical performances and allows for tractable melody retrieval. While most of previous studies on music retrieval from melodies were performed with either symbolic (e.g. MIDI) data or with monophonic (single instrument) performances, we performed experiments in retrieving live and studio recordings of operas that contain a leading vocalist and rich instrumental accompaniment. Our results show that the probabilistic approach we propose is effective and can be scaled to massive datasets.

Copyrights may apply

p. 339-346

Graves, Andrew and Lalmas, Mounia (2002): Video retrieval using an MPEG-7 based inference network. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 339-346. Available online

This work proposes a model for video retrieval based upon the inference network model. The document network is constructed using video metadata encoded using MPEG-7 and captures information pertaining to the structural aspects (video breakdown into shots and scenes), conceptual aspects (video, scene and shot content) and contextual aspects (context information about the position of conceptual content within the document). The retrieval process a) exploits the distribution of evidence among the shots to perform ranking of different levels of granularity, b) addresses the idea that evidence may be inherited during evaluation, and c) exploits the contextual information to perform constrained queries.

Copyrights may apply

p. 349-350

Peng, Fuchun, Huang, Xiangji, Schuurmans, Dale, Cercone, Nick and Robertson, Stephen E. (2002): Using self-supervised word segmentation in Chinese information retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 349-350. Available online

We propose a self-supervised word-segmentation technique for Chinese information retrieval. This method combines the advantages of traditional dictionary based approaches with character based approaches, while overcoming many of their shortcomings. Experiments on TREC data show comparable performance to both the dictionary based and the character based approaches. However, our method is language independent and unsupervised, which provides a promising avenue for constructing accurate multilingual information retrieval systems that are flexible and adaptive.

Copyrights may apply

p. 35-41

Hiemstra, Djoerd (2002): Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 35-41. Available online

This paper follows a formal approach to information retrieval based on statistical language models. By introducing some simple reformulations of the basic language modeling approach we introduce the notion of importance of a query term. The importance of a query term is an unknown parameter that explicitly models which of the query terms are generated from the relevant documents (the important terms), and which are not (the unimportant terms). The new language modeling approach is shown to explain a number of practical facts of today's information retrieval systems that are not very well explained by the current state of information retrieval theory, including stop words, mandatory terms, coordination level ranking and retrieval using phrases.

Copyrights may apply

p. 351-352

Wolin, Ben (2002): Automatic classification in product catalogs. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 351-352. Available online

In this paper, we present the AutoCat system for product classification. AutoCat uses a vector space model, modified to consider product attributes unavailable in traditional document classification. We present key features of our user interface, developed to assist users with evaluating and editing the output of the classification algorithm. Finally, we present observations about the use of this technology in the field.

Copyrights may apply

p. 353-354

Ding, Chris, He, Xiaofeng, Husbands, Parry, Zha, Hongyuan and Simon, Horst D. (2002): PageRank, HITS and a unified framework for link analysis. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 353-354. Available online

Two popular link-based webpage ranking algorithms are (i) PageRank[1] and (ii) HITS (Hypertext Induced Topic Selection)[3]. HITS makes the crucial distinction of hubs and authorities and computes them in a mutually reinforcing way. PageRank considers the hyperlink weight normalization and the equilibrium distribution of random surfers as the citation score. We generalize and combine these key concepts into a unified framework, in which we prove that rankings produced by PageRank and HITS are both highly correlated with the ranking by in-degree and out-degree.

Copyrights may apply

p. 355-356

Murdock, Vanessa and Croft, W. Bruce (2002): Task orientation in question answering. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 355-356. Available online

p. 357-358

Damerau, Fred J., Zhang, Tong, Weiss, Sholom M. and Indurkhya, Nitin (2002): Experiments in high-dimensional text categorization. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 357-358. Available online

We present results for automated text categorization of the Reuters-810000 collection of news stories. Our experiments use the entire one-year collection of 810,000 stories and the entire subject index. We divide the data into monthly groups and provide an initial benchmark of text categorization performance on the complete collection. Experimental results show that efficient sparse-feature implementations of linear methods and decision trees, using a global unstemmed dictionary, can readily handle applications of this size. Predictive performance is approximately as strong as the best results for the much smaller older Reuters collections. Detailed results are provided over time periods. It is shown that a smaller time horizon does not diminish predictive quality, implying reduced demands for retraining when sample size is large.

Copyrights may apply

p. 359-360

Yuan, Xiao-Jun, Belkin, Nicholas J. and Kim, Ja-Young (2002): The relationship between ASK and relevance criteria. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 359-360. Available online

p. 361-362

Bingham, Ella, Kuusisto, Jukka and Lagus, Krista (2002): ICA and SOM in text document analysis. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 361-362. Available online

In this study we show experimental results on using Independent Component Analysis (ICA) and the Self-Organizing Map (SOM) in document analysis. Our documents are segments of spoken dialogues carried out over the telephone in a customer service, transcribed into text. The task is to analyze the topics of the discussions, and to group the discussions into meaningful subsets. The quality of the grouping is studied by comparing to a manual topical classification of the documents.

Copyrights may apply

p. 363-364

Boyapati, Vijay (2002): Improving hierarchical text classification using unlabeled data. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 363-364. Available online

p. 365-366

Dziadosz, Susan and Chandrasekar, Raman (2002): Do thumbnail previews help users make better relevance decisions about web search results?. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 365-366. Available online

We describe an empirical evaluation of the utility of thumbnail previews in web search results. Results pages were constructed to show text-only summaries, thumbnail previews only, or the combination of text summaries and thumbnail previews. We found that in the combination case, users were able to make more accurate decisions about the potential relevance of results than in either of the other versions, with hardly any increase in speed of processing the page as a whole.

Copyrights may apply

p. 367-368

Ciravegna, Fabio, Dingli, Alexiei, Wilks, Yorick and Petrelli, Daniela (2002): Amilcare: adaptive information extraction for document annotation. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 367-368. Available online

p. 369-370

Clarke, C. L. A., Cormack, Gordon V., Laszlo, M., Lynam, T. R. and Terra, E. L. (2002): The impact of corpus size on question answering performance. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 369-370. Available online

Using our question answering system, questions from the TREC 2001 evaluation were executed over a series of Web data collections, with the sizes of the collections increasing from 25 gigabytes up to nearly a terabyte.

Copyrights may apply

p. 371-372

Conrad, Jack G., Yang, Changwen and Claussen, Joanne S. (2002): Effective collection metasearch in a hierarchical environment: global vs. localized retrieval performance. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 371-372. Available online

We compare standard global IR searching with user-centric localized techniques to address the database selection problem. We conduct a series of experiments to compare the retrieval effectiveness of three separate search modes applied to a hierarchically structured data environment of textual database representations. The data environment is represented as a tree-like directory containing over 15,000 unique databases and over 100,000 total leaf nodes. Our search modes consist of varying degrees of browse and search, from a global search at the root node to a refined search at a sub-node using dynamically-calculated inverse document frequencies (idfs) to score candidate databases for probable relevance. Our findings indicate that a browse and search approach that relies upon localized searching from sub-nodes is capable of producing the most effective results.

Copyrights may apply

p. 373-374

Crestani, Fabio, Fuente, Pablo de la and Vegas, Jesus (2002): Experimenting with graphical user interfaces for structured document retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 373-374. Available online

p. 375-376

Eguchi, Koji, Oyama, Keizo, Ishida, Emi, Kuriyama, Kazuko and Kando, Noriko (2002): The web retrieval task and its evaluation in the third NTCIR workshop. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 375-376. Available online

This paper gives an overview of the evaluation method used for the Web Retrieval Task in the Third NTCIR Workshop, which is currently in progress. In the Web Retrieval Task, we try to assess the retrieval effectiveness of each Web search engine system using a common data set, and attempt to build a re-usable test collection suitable for evaluating Web search engine systems. With these objectives, we have built 100-gigabyte and 10-gigabyte document sets, mainly gathered from the '.jp' domain. Relevance judgment is performed on the retrieved documents, which are written in Japanese or English.

Copyrights may apply

p. 377-378

Franz, Martin and McCarley, J. Scott (2002): How Many Bits are Needed to Store Term Frequencies?. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 377-378. Available online

Search algorithms in most current text retrieval systems use index data structures extracted from the original text documents. In this paper we focus on reducing the size of the indices by reducing the amount of space dedicated to store term frequencies. In experiments using TREC Ad Hoc [2, 3] corpora and query sets, we show that it is possible to store the term frequency in only two bits without decreasing retrieval performance.

Copyrights may apply

p. 379-380

Gery, Mathias (2002): Non-linear reading for a structured web indexation. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 379-380. Available online

The growth of the Web has posed new challenges for Information Retrieval (IR). Most of the current systems are based on traditional models, which have been developed for atomic and independents documents and are not adapted to the Web. A promising research orientation consists of studying the impact of the Web structure on indexing. The HyperDocument model presented in this article is based on essential aspects of information comprehension: content, composition and linear/non-linear reading.

Copyrights may apply

p. 381-382

Chowdhury, Abdur, McCabe, M. Catherine, Grossman, David A. and Frieder, Ophir (2002): Document normalization revisited. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 381-382. Available online

Cosine Pivoted Document Length Normalization has reached a point of stability where many researchers indiscriminately apply a specific value of 0.2 regardless of the collection. Our efforts, however, demonstrate that applying this specific value without tuning for the document collection degrades average

Copyrights may apply

p. 385-386

Hoashi, Keiichiro, Zeitler, Erik and Inoue, Naomi (2002): Implementation of relevance feedback for content-based music retrieval based on user preferences. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 385-386. Available online

p. 387-388

Jones, Christopher B., Purves, R., Ruas, A., Sanderson, M., Sester, M., Kreveld, M. van and Weibel, R. (2002): Spatial information retrieval and geographical ontologies an overview of the SPIRIT project. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 387-388. Available online

p. 389-390

Jones, Gareth J. F. and Gabb, Steven M. (2002): A visualisation tool for topic tracking analysis and development. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 389-390. Available online

Topic Detection and Tracking (TDT) research explores the development of algorithms to detect novel events and track their development over time for online reports. Development of these methods requires careful evaluation and analysis. Traditional reductive methods of evaluation only represent some of the available information of algorithm behaviour. We describe a visualisation tool for topic tracking which makes it easy to analysis and compare the temporal behaviour of tracking algorithms.

Copyrights may apply

p. 391-392

Kim, Sang-Bum, Rim, Hae-Chang and Lim, Heui-Seok (2002): A new method of parameter estimation for multinomial naive bayes text classifiers. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 391-392. Available online

Multinomial naive Bayes classifiers have been widely used for the probabilistic text classification. However, their parameter estimation method sometimes generates inappropriate probabilities. In this paper, we propose a topic document model approach for naive Bayes text classification, where their parameters are estimated with an expectation from the training documents. Experiments are conducted on Reuters 21578 and 20 Newsgroup collection, and our proposed approach obtained a significant improvement in performance over the conventional approach.

Copyrights may apply

p. 393-394

KOU, Huaizhong and Gardarin, Georges (2002): Study of category score algorithms for k-NN classifier. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 393-394. Available online

We analyzes category score algorithms for k-NN classifier found in the literature, including majority voting algorithm (MVA), simple sum algorithm (SSA). MVA and SSA are two mainly used algorithms to estimate score for candidate categories in k-NN classifier systems. Based on the hypothesis that utilization of internal relation between documents and categories could improve system performance, two new weighting score models: concept-based weighting (CBW) score model and term independence-based weighting (IBW) score model are proposed. Our experimental results confirm our hypothesis and show that in the term of precision average IBW and CBW are better than the other score models, while SSA is higher than MVA. According to macro-average F1 CBW performs best. Rocchio-based algorithm (RBA) always performs worst.

Copyrights may apply

p. 395-396

Kwok, K. L. (2002): Higher precision for two-word queries. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 395-396. Available online

Queries have specific properties, and may need individualized methods and parameters to optimize retrieval. Length is one property. We look at how two-word queries may attain higher precision by re-ranking using word co-occurrence evidence in retrieved documents. Co-occurrence within document context is not sufficient, but window context including sentence context evidence can provide precision improvements at low recall region of 4 to 10% using initial retrieval results, and positively affects pseudo-relevance feedback.

Copyrights may apply

p. 397-398

Larsen, Birger and Ingwersen, Peter (2002): The boomerang effect: retrieving scientific documents via the network of references and citations. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 397-398. Available online

p. 399-400

Larson, Ray R. (2002): A logistic regression approach to distributed IR. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 399-400. Available online

This poster session examines a probabilistic approach to distributed information retrieval using a Logistic Regression algorithm for estimation of collection relevance. The algorithm is compared to other methods for distributed search using test collections developed for distributed search evaluation.

Copyrights may apply

p. 401-402

Liddy, Elizabeth D., Allen, Eileen, Harwell, Sarah, Corieri, Susan, Yilmazel, Ozgur, Ozgencil, N. Ercan, Diekema, Anne, McCracken, Nancy, Silverstein, Joanne and Sutton, Stuart (2002): Automatic metadata generation & evaluation. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 401-402. Available online

The poster reports on a project in which we are investigating methods for breaking the human metadata-generation bottleneck that plagues Digital Libraries. The research question is whether metadata elements and values can be automatically generated from the content of educational resources, and correctly assigned to mathematics and science educational materials. Natural Language Processing and Machine Learning techniques were implemented to automatically assign values of the GEMgenerate metadata element set tofor learning resources provided by the Gateway for Education (GEM), a service that offers web access to a wide range of educational materials. In a user study, education professionals evaluated the metadata assigned to learning resources by either automatic tagging or manual assignment. Results show minimal difference in the eyes of the evaluators between automatically generated metadata and manually assigned metadata.

Copyrights may apply

p. 403-404

Manmatha, R., Feng, Ao and Allan, James (2002): A critical examination of TDT's cost function. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 403-404. Available online

Topic Detection and Tracking (TDT) tasks are evaluated using a cost function. The standard TDT cost function assumes a constant probability of relevance P(rel) across all topics. In practice, P(rel) varies widely across topics. We argue using both theoretical and experimental evidence that the cost function should be modified to account for the varying P(rel).

Copyrights may apply

p. 405-406

Mayfield, James and McNamee, Paul (2002): Converting on-line bilingual dictionaries from human-readable to machine-readable form. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 405-406. Available online

We describe a language called ABET that allows rapid conversion of on-line human-readable bilingual dictionaries to machine-readable form.

Copyrights may apply

p. 407-408

Nomoto, Tadashi and Matsumoto, Yuji (2002): Modeling (in)variability of human judgments for text summarization. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 407-408. Available online

The paper proposes and empirically motivates an integration of supervised learning with unsupervised learning to deal with human biases in summarization. In particular, we explore the use of probabilistic decision tree within the clustering framework to account for the variation as well as regularity in human created summaries.

Copyrights may apply

p. 409-410

Rauber, Andreas, Pampalk, Elias and Merkl, Dieter (2002): Content-based music indexing and organization. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 409-410. Available online

While electronic music archives are gaining popularity, access to and navigation within these archives is usually limited to text-based queries or manually predefined genre category browsing. We present a system that automatically organizes a music collection according to the perceived sound similarity resembling genres or styles of music. Audio signals are processed according to psychoacoustic models to obtain a time-invariant representation of its characteristics. Subsequent clustering provides an intuitive interface where similar pieces of music are grouped together on a map display.

Copyrights may apply

p. 411-412

Sakai, Tetsuya and Robertson, Stephen E. (2002): Relative and absolute term selection criteria: a comparative study for English and Japanese IR. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 411-412. Available online

p. 413-414

Shou, Xiao Mang and Sanderson, Mark (2002): Experiments on data fusion using headline information. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 413-414. Available online

This poster describes initial work exploring a relatively unexamined area of data fusion: fusing the results of retrieval systems whose collections have no overlap between them. Many of the effective meta-search/data fusion strategies gain much of their success from exploiting document overlap across the source systems being merged. When the intersection of the collections is the empty set, the strategies generally degrade to a simpler form. In order to address such situations, two strategies were examined: re-ranking of merged results using a locally run search on the text fragments returned by the source search engines; and re-ranking based on cross document similarity, again using text fragments presented in the retrieved list. Results, from experiments, which go beyond previous work, indicate that both strategies improve fusion effectiveness.

Copyrights may apply

p. 415-416

Lavelli, Alberto, Magnini, Bernardo and Sebastiani, Fabrizio (2002): Building thematic lexical resources by term categorization. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 415-416. Available online

We discuss the automatic generation of thematic lexicons by means of term categorization, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the generation of such lexicons as an iterative process of learning previously unknown associations between terms and themes (i.e. disciplines, or fields of activity). The process is iterative, in that it generates, for each ci in a set C = {c1,...,cm} of themes, a sequence Li0? Li1? ... ? Lin of lexicons, bootstrapping from an initial lexicon Li0 and a set of text corpora Θ = {θ0,...,θn-1} given as input. The method is inspired by text categorization, the discipline concerned with labelling natural language texts with labels from a predefined set of themes, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, term categorization deals (dually) with terms represented as vectors in a space of documents, and labels terms (instead of documents) with themes. As a learning device we adopt boosting, since (a) it has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) it naturally allows for a form of "data cleaning", thereby making the process of generating a thematic lexicon an iteration of generate-and-test steps.

Copyrights may apply

p. 417-418

Evans, David A., Shanahan, James G. and Sheftel, Victor (2002): Topic structure modeling. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 417-418. Available online

In this paper, we present a method based on document probes to quantify and diagnose topic structure, distinguishing topics as monolithic, structured, or diffuse. The method also yields a structure analysis that can be used directly to optimize filter (classifier) creation. Preliminary results illustrate the predictive value of the approach on TREC/Reuters-96 topics.

Copyrights may apply

p. 419-420

Jin, Rong, Si, Luo, Hauptmann, Alexander G. and Callan, Jamie (2002): Language model for IR using collection information. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 419-420. Available online

Information retrieval using meta data can be traced back to the early age of IR where documents are represented by the controlled vocabulary. In this paper, we explore the usage of meta-data information under the framework of language model. We present a new language model that is able to take advantage of the category information for documents to improve the retrieval accuracy. We compare the new language model with the traditional language model over the TREC4 dataset where the collection information for documents is obtained using the k-means clustering method. The new language model outperforms the traditional language model, which verifies our statement.

Copyrights may apply

p. 42-48

Jin, Rong, Hauptmann, Alexander G. and Zhai, Cheng Xiang (2002): Title language model for information retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 42-48. Available online

In this paper, we propose a new language model, namely, a title language model, for information retrieval. Different from the traditional language model used for retrieval, we define the conditional probability P(Q|D) as the probability of using query Q as the title for document D. We adopted the statistical translation model learned from the title and document pairs in the collection to compute the probability P(Q|D). To avoid the sparse data problem, we propose two new smoothing methods. In the experiments with four different TREC document collections, the title language model for information retrieval with the new smoothing method outperforms both the traditional language model and the vector space model for IR significantly.

Copyrights may apply

p. 421-422

Chowdhury, Abdur and Soboroff, Ian (2002): Automatic evaluation of world wide web search services. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 421-422. Available online

Users of the World-Wide Web are not only confronted by an immense overabundance of information, but also by a plethora of tools for searching for the web pages that suit their information needs. Web search engines differ widely in interface, features, coverage of the web, ranking methods, delivery of advertising, and more. In this paper, we present a method for comparing search engines automatically based on how they rank known item search results. Because the engines perform their search on overlapping (but different) subsets of the web collected at different points in time, evaluation of search engines poses significant challenges to the traditional information retrieval methodology. Our method uses known item searching; comparing the relative ranks of the items in the search engines' rankings. Our approach automatically constructs known item queries using query log analysis and automatically constructs the result via analysis of editor comments from the ODP (Open Directory Project). Additionally, we present our comparison on five (Lycos, Netscape, Fast, Google, HotBot) well-known search services and find that some services perform known item searches better than others, but the majority are statistically equivalent.

Copyrights may apply

p. 423-424

Soboroff, Ian (2002): Does WT10g look like the web?. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 423-424. Available online

We measure the WT10g test collection, used in the TREC-9 and TREC 2001 Web Tracks, with common measures used in the web topology community, in order to see if WT10g "looks like" the web. This is not an idle question; characteristics of the web, such as power law relationships, diameter, and connected components have all been observed within the scope of general web crawls, constructed by blindly following links. In contrast, WT10g was carved out from a larger crawl specifically to be a web search test collection within the reach of university researchers. Does such a collection retain the properties of the larger web? In the case of WT10g, yes.

Copyrights may apply

p. 425-426

Srikanth, Munirathnam and Srihari, Rohini (2002): Biterm language models for document retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 425-426. Available online

p. 427-428

Takeda, Yoshiyuki and Umemura, Kyoji (2002): Selecting indexing strings using adaptation. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 427-428. Available online

It is not easy to tokenize agglutinative languages like Japanese and Chinese into words. Many IR systems start with a dictionary-based morphology program like ChaSen [4]. Unfortunately, dictionaries cannot cover all possible words; unknown words such as proper nouns are important for IR. This paper proposes a statistical dictionary-free method for selecting index strings based on recent work on adaptive language modeling.

Copyrights may apply

p. 429-430

Tseng, Yuen-Hsien (2002): Error correction in a Chinese OCR test collection. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 429-430. Available online

This article proposes a technique for correcting Chinese OCR errors to support retrieval of scanned documents. The technique uses a completely automatic technique (no manually constructed lexicons or confusion resources) to identify both keywords and confusable terms. Improved retrieval effectiveness on a single term query experiment is demonstrated.

Copyrights may apply

p. 431-432

Turpin, Andrew and Hersh, William (2002): User interface effects in past batch versus user experiments. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 431-432. Available online

p. 433-434

Verma, Rakesh M. and Behl, Sanjiv (2002): K-tree/forest: efficient indexes for boolean queries. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 433-434. Available online

In Information Retrieval it is well-known that the complexity of processing boolean queries depends on the size of the intermediate results, which could be huge (and are typically on disk) even though the size of the final result may be quite small. In the case of inverted files the most time consuming operation is the merging or intersection of the list of occurrences [1]. We propose, the Keyword tree (K-tree) and forest, efficient structures to handle boolean queries in keyword-based information retrieval. Extensive simulations show that K-tree is orders-of-magnitude faster (i.e., far fewer I/O's) for boolean queries than the usual approach of merging the lists of occurrences and incurs only a small overhead for single keyword queries. The K-tree can be efficiently parallelized as well. The construction cost of K-tree is comparable to the cost of building inverted files.

Copyrights may apply

p. 435-436

Wang, Bin, Cheng, Xueqi and Bai, Shuo (2002): Example-based phrase translation in Chinese-English CLIR. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 435-436. Available online

This paper proposes an example-based phrase translation method in a Chinese to English cross-language information retrieval (CLIR) system. The method can generate much more accurate query translations than dictionary-based and common MT-based methods, and then improves the retrieval performance of our CLIR system.

Copyrights may apply

p. 437-438

Westerveld, Thijs (2002): Probabilistic multimedia retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 437-438. Available online

We present a framework in which probabilistic models for textual and visual information retrieval can be integrated seamlessly. The framework facilitates searching for imagery using textual descriptions and visual examples simultaneously. The underlying Language Models for text and Gaussian Mixture Models for images have proven successful in various retrieval tasks.

Copyrights may apply

p. 439-440

Yang, Wenfeng (2002): Chinese keyword extraction based on max-duplicated strings of the documents. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 439-440. Available online

The corpus analysis methods in Chinese keyword extraction look on the corpus as a single sample of language stochastic process. But the distributions of keywords in the whole corpus and in each document are very different from each other. The extraction based on global statistical information only can get significant keywords in the whole corpus. Max-duplicated strings contain the local significant keywords in each document. In this paper, we designed an efficient algorithm to extract the max-duplicated strings by building PAT-tree for the document, so that the keywords can be picked out from the max-duplicated strings by their SIG values in the corpus.

Copyrights may apply

p. 441-442

Feng, Yazhong, Zhuang, Yueting and Pan, Yunhe (2002): A hierarchical approach: query large music database by acoustic input. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 441-442. Available online

p. 443-444

Zha, Hongyuan and Ji, Xiang (2002): Correlating multilingual documents via bipartite graph modeling. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 443-444. Available online

There is enormous amount of multilingual documents from various sources and possibly from different countries describing a single event or a set of related events. It is desirable to construct text mining methods that can compare and highlight similarities and differences of those multilingual documents. We discuss our ongoing research that seeks to model a pair of multilingual documents as a weighted bipartite graph with the edge weights computed by means of machine translation. We use spectral method to identify dense subgraphs of the weighted bipartite graph which can be considered as corresponding to sentences that correlate well in textual contents. We illustrate our approach using English and German texts.

Copyrights may apply

p. 446

White, Ryen W., Jose, Joemon M. and Ruthven, Ian (2002): A system using implicit feedback and top ranking sentences to help users find relevant web documents. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 446. Available online

We present a web search interface designed to encourage users to interact more fully with the results of a web search. Wrapping around a major commercial search engine, the system combines three main features; real-time query-biased web document summarisation, the presentation of sentences highly relevant to the searcher's query, and evidence captured from searcher interaction with the retrieval results.

Copyrights may apply

p. 447

Hurst, Wolfgang (2002): Indexing, searching, and retrieving of recorded live presentations with the AOF (authoring on the fly) search engine. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 447. Available online

The tremendous amount of data resulting from the regular usage of tools for automatic presentation recording demand for elaborate search functionality. A detailed analysis of the according multimedia documents is required to allow search at a very detailed level. Unfortunately, the produced data differs significantly from traditional documents. In this demo, we discuss the problems appearing in the presentation retrieval scenario and introduce aofSE, a search engine to study and illustrate these problems as well as to develop and present according solutions and new approaches for this task.

Copyrights may apply

p. 448

Keskustalo, Heikki, Hedlund, Turid and Airio, Eija (2002): UTACLIR: general query translation framework for several language pairs. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 448. Available online

p. 449

Fuhr, Norbert, Govert, Norbert and Grossjohann, Kai (2002): HyREX: hyper-media retrieval engine for XML. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 449. Available online

p. 450

Sormunen, Eero, Hokkanen, Sakari, Kangaslampi, Petteri, Pyy, Petri and Sepponen, Bemmu (2002): Query performance analyser: a web-based tool for IR research and instruction. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 450. Available online

The Interactive Query Performance Analyser (QPA) for information retrieval systems is a Web-based tool for analysing and comparing the performance of individual queries. On top of a standard test collection, it gives an instant visualisation of the performance achieved in a given search topic by any user-generated query. In addition to experimental IR research, QPA can be used in user training to demonstrate the characteristics of and compare differences between IR systems and searching strategies. The first prototype (versions 3.0 and 3.5) of the Query Performance Analyser was developed at the Department of Information Studies, University of Tampere, to serve as a tool for rapid query performance analysis, comparison and visualisation [4,5]. Later, it has been applied to interactive optimisation of queries [2,3]. The analyser has served also in learning environments for IR [1]. The demonstration is based on the newest version of the Query Performance Analyser (v. 5.1). It is interfaced to a traditional Boolean IR system (TRIP) and a probabilistic IR system (Inquery) providing access to the TREC collection and two Finnish test collections. Version 5.1 supports multigraded relevance scales, new types of performance visualisations, and query conversions based on mono- and multi-lingual dictionaries. The motivation in developing the analyser is to emphasise the necessity of analysing the behaviour of individual queries. Information retrieval experiments usually measure the average effectiveness of IR methods developed. The analysis of individual queries is neglected although test results may contain individual test topics where general findings do not hold. For the real user of an IR system, the study of variation in results is even more important than averages.

Copyrights may apply

p. 451

Ciravegna, Fabio, Dingli, Alexiei, Wilks, Yorick and Petrelli, Daniela (2002): Adaptive information extraction for document annotation in amilcare. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 451. Available online

Amilcare is a tool for Adaptive Information Extraction (IE) designed for supporting active annotation of documents for the Semantic Web (SW). It can be used either for unsupervised document annotation or as a support for human annotation. Amilcare is portable to new applications/domains without any knowledge of IE, as it just requires users to annotate a small training corpus with the information to be extracted. It is based on (LP)2, a supervised learning strategy for IE able to cope with different texts types, from newspaper-like texts, to rigidly formatted Web pages and even a mixture of them[1][5]. Adaptation starts with the definition of a tag set for annotation, possibly organized as an ontology. Then users have to manually annotate a small training corpus. Amilcare provides a default mouse-based interface called Melita, where annotations are inserted by first selecting a tag from the ontology and then identifying the text area to annotate with the mouse. Differently from similar annotation tools [4, 5], Melita actively supports training corpus annotation. While users annotate texts, Amilcare runs in the background learning how to reproduce the inserted annotation. Induced rules are silently applied to new texts and their results are compared with the user annotation. When its rules reach a (user-defined) level of accuracy, Melita presents new texts with a preliminary annotation derived by the rule application. In this case users have just to correct mistakes and add missing annotations. User corrections are inputted back to the learner for retraining. This technique focuses the slow and expensive user activity on uncovered cases, avoiding requiring annotating cases where a satisfying effectiveness is already reached. Moreover validating extracted information is a much simpler task than tagging bare texts (and also less error prone), speeding up the process considerably. At the end of the corpus annotation process, the system is trained and the application can be delivered. MnM [6] and Ontomat annotizer [7] are two annotation tools adopting Amilcare's learner. In this demo we simulate the annotation of a small corpus and we show how and when Amilcare is able to support users in the annotation process, focusing on the way the user can control the tool's proactivity and intrusivity. We will also quantify such support with data derived from a number of experiments on corpora. We will focus on training corpus size and correctness of suggestions when the corpus is increased.

Copyrights may apply

p. 453

Spoerri, Anselm (2002): Souvenir: flexible note-taking tool to pinpoint and share media highlights. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 453. Available online

Digital media audio/video can be difficult to search and share in a personal way. Souvenir is a software system that offers users a flexible and comprehensive way to use their handwritten or text notes to retrieve and share specific media moments. Users can take notes on a variety of devices, such as the paper-based CrossPad, the Palm Pilot and standard keyboard devices. Souvenir segments handwritten notes into an effective media index without the need for handwriting recognition. Users can use their notes to create hyperlinks to random-access media stored in a digital library. Souvenir also has web publishing and email capabilities to enable anyone to access or email media moments directly from a web page. Souvenir annotations capture information that can not be easily inferred by automatic media indexing tools.

Copyrights may apply

p. 454

Joho, Hideo, Sanderson, Mark and Beaulieu, Micheline (2002): Hierarchical approach to term suggestion device. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 454. Available online

Our demonstration shows the hierarchy system working on a locally run search engine. Hierarchies are dynamically generated from the retrieved documents, and visualised on the menus. When a user selects a term from the hierarchy, the documents linked to the term are listed, and the term is then added to the initial query to rerun a search. Through the demonstration we illustrate how hierarchical presentation of expansion terms is achieved, and how our approach supports users to articulate their information needs using the hierarchy.

Copyrights may apply

p. 455

Gey, Fredric C., Chen, Aitao, Buckland, Michael and Larson, Ray (2002): Translingual vocabulary mappings for multilingual information access. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 455. Available online

p. 456

Honkela, Jukka and Tuulos, Ville H. (2002): GS textplorer: adaptive framework for information retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 456. Available online

p. 457

Davulcu, Hasan, Mukherjee, Saikat, Seth, Arvind and Ramakrishnan, I. V. (2002): CuTeX: a system for extracting data from text tables. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 457. Available online

A wealth of information relevant for e-commerce often appears in text form. This includes specification and performance data sheets of products, financial statements, product offerings etc. Typically these types of product and financial data are published in tabular form. The only separators between items in the table are white spaces and line separators. We will refer to such tables as text tables. Due to the lack of structure in such tables, the information present is not readily queriable using traditional database query languages like SQL. One way to make it amenable to standard database querying techniques is to extract the data items in the tables and create a database out of the extracted data. But extraction from text tables poses difficulties due to the irregularity of the data in the column.

Copyrights may apply

p. 458

Choudhari, Prashant, Davulcu, Hasan, Joglekar, Abhishek, More, Akshay, Mukherjee, Saikat, Patil, Supriya and Ramakrishnan, I. V. (2002): YellowPager: a tool for ontology-based mining of service directories from web sources. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. p. 458. Available online

The web has established itself as the dominant medium for doing electronic commerce. Realizing that its global reach provides significant market and business opportunities, service providers, both large and small are advertising their services on the web. A number of them operate their own web sites promoting their services at length while others are merely listed in a referral site. Aggregating all of the providers into a queriable service directory makes it easy for customers to locate the one most suited for his/her needs. YellowPager is a tool for creating service directories by mining web sources. Service directories created by YellowPager have several merits compared to those generated by existing practices, which typically require participation by service providers (e.g. Verizon's SuperYellowPages.com). Firstly, the information content will be rich. Secondly since the process is automated and repeatable the content can always be kept current. Finally the same process can be readily adapted to different domains. YellowPager builds service directories by mining the web through a combination of keyword-based search engines, web agents, text classifiers and novel extraction algorithms. The extraction is driven by a services ontology consisting of a taxonomy of service concepts and their associated attributes (such as names and addresses) and type descriptions for the attributes. In addition the ontology also associates an extractor function with each attribute. Applying the function to a web page will identify all the occurrences of the attribute in that page. YellowPager's mining algorithm consists of a training step followed by classification and extraction steps. In the training step a classifier is trained to identify web pages relevant to the service of interest. The classification step proceeds by doing a search for the particular service of interest using a keyword based web search engine and retrieves all the matching web pages. From these pages the relevant ones are identified using the classifier. The final step is extraction of attribute values, associated with the service, from these pages. Each web page is parsed into a DOM tree and the extractor functions are applied. All of the attributes corresponding to a service provider are then correctly aggregated. This can pose difficulties especially in the presence of multiple service providers in a page. Using a novel concept of scoring and conflict resolution to prevent erroneous associations of attributes with service provider entities in the page, the algorithm aggregates all the attribute occurrences correctly. The extractor function may not be complete in the sense that it cannot always identify all the attributes in a page. By exploiting the regularity of the sequence in which attributes occur in referral pages, the mining algorithm automatically learns generalized patterns to locate attributes that the extractor function misses. The distinguishing aspects of YellowPager's extraction algorithm are: (i) it is unsupervised, and (ii) the attribute values in the pages are extracted independent of any page-specific relationships that may exist among the markup tags. YellowPager has been used by a large pet food producer to build a directory of veterinarian service providers in the United States. The resulting database was found to be much larger and richer than that found in Vetquest, Vetworld, and the Super Yellow pages. YellowPager is implemented in JAVA and is interfaced to Rainbow, a library utility in C that is used for classification. The tool will demonstrate the creation of a service directory for any service domain by mining web sources.

Copyrights may apply

p. 49-56

Zhai, Chengxiang and Lafferty, John (2002): Two-stage language models for information retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 49-56. Available online

The optimal settings of retrieval parameters often depend on both the document collection and the query, and are usually found through empirical tuning. In this paper, we propose a family of two-stage language models for information retrieval that explicitly captures the different influences of the query and document collection on the optimal settings of retrieval parameters. As a special case, we present a two-stage smoothing method that allows us to estimate the smoothing parameters completely automatically. In the first stage, the document language model is smoothed using a Dirichlet prior with the collection language model as the reference model. In the second stage, the smoothed document language model is further interpolated with a query background language model. We propose a leave-one-out method for estimating the Dirichlet parameter of the first stage, and the use of document mixture models for estimating the interpolation parameter of the second stage. Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to -- or better than -- the best results achieved using a single smoothing method and exhaustive parameter search on the test data.

Copyrights may apply

p. 57-64

White, Ryen W., Ruthven, Ian and Jose, Joemon M. (2002): Finding relevant documents using top ranking sentences: an evaluation of two alternative schemes. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 57-64. Available online

In this paper we present an evaluation of techniques that are designed to encourage web searchers to interact more with the results of a web search. Two specific techniques are examined: the presentation of sentences that highly match the searcher's query and the use of implicit evidence. Implicit evidence is evidence captured from the searcher's interaction with the retrieval results and is used to automatically update the display. Our evaluation concentrates on the effectiveness and subject perception of these techniques. The results show, with statistical significance, that the techniques are effective and efficient for information seeking.

Copyrights may apply

p. 65-72

Chen, Mao, LaPaugh, Andrea S. and Singh, Jaswinder Pal (2002): Predicting category accesses for a user in a structured information space. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 65-72. Available online

In a categorized information space, predicting users' information needs at the category level can facilitate personalization, caching and other topic-oriented services. This paper presents a two-phase model to predict the category of a user's next access based on previous accesses. Phase 1 generates a snapshot of a user's preferences among categories based on a temporal and frequency analysis of the user's access history. Phase 2 uses the computed preferences to make predictions at different category granularities. Several alternatives for each phase are evaluated, using the rating behaviors of on-line raters as the form of access considered. The results show that a method based on re-access pattern and frequency analysis of a user's whole history has the best prediction quality, even over a path-based method (Markov model) that uses the combined history of all users.

Copyrights may apply

p. 73-80

Smith, David A. (2002): Detecting and Browsing Events in Unstructured text. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 73-80. Available online

Previews and overviews of large, heterogeneous information resources help users comprehend the scope of collections and focus on particular subsets of interest. For narrative documents, questions of "what happened? where? and when?" are natural points of entry. Building on our earlier work at the Perseus Project with detecting terms, place names, and dates, we have exploited co-occurrences of dates and place names to detect and describe likely events in document collections. We compare statistical measures for determining the relative significance of various events. We have built interfaces that help users preview likely regions of interest for a given range of space and time by plotting the distribution and relevance of various collocations. Users can also control the amount of collocation information in each view. Once particular collocations are selected, the system can identify key phrases associated with each possible event to organize browsing of the documents themselves.

Copyrights may apply

p. 81-88

Zhang, Yi, Callan, Jamie and Minka, Thomas (2002): Novelty and redundancy detection in adaptive filtering. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 81-88. Available online

This paper addresses the problem of extending an adaptive information filtering system to make decisions about the novelty and redundancy of relevant documents. It argues that relevance and redundance should each be modelled explicitly and separately. A set of five redundancy measures are proposed and evaluated in experiments with and without redundancy thresholds. The experimental results demonstrate that the cosine similarity metric and a redundancy measure based on a mixture of language models are both effective for identifying redundant documents.

Copyrights may apply

p. 89-96

Leuski, Anton and Allan, James (2002): Improving realism of topic tracking evaluation. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 89-96. Available online

Topic tracking and information filtering are models of interactive tasks, but their evaluations are generally done in a way that does not reflect likely usage. The models either force frequent judgments or disallow any at all, assume the user is always available to make a judgment, and do not allow for user fatigue. In this study we extend the evaluation framework for topic tracking to incorporate those more realistic issues. We demonstrate that tracking can be done in a realistic interactive setting with minimal impact on tracking cost and with substantial reduction in required interaction.

Copyrights may apply

p. 97-104

Chai, Kian Ming Adam, Chieu, Hai Leong and Ng, Hwee Tou (2002): Bayesian online classifiers for text classification and filtering. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002. pp. 97-104. Available online

This paper explores the use of Bayesian online classifiers to classify text documents. Empirical results indicate that these classifiers are comparable with the best text classification systems. Furthermore, the online approach offers the advantage of continuous learning in the batch-adaptive text filtering task.

Copyrights may apply




What do YOU think?

Give us your opinion! Do you have any comments/additions that you would like other visitors to see?

 
comment You say: Mar 17th, 2010
#1
Be the first to add a thoughtful note to this page ! 

  will be spam-protected
 

 
How many?
=
e.g. "6"
 

Changes to this page (conference)

24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
24 Jun 2007: Conference Proceedings was edited
Mar 17

More and more we're being asked to live with technology that is technically reliable, because it was created to fit our knowledge of the physical world, but that is so complex or so counterintuitive that it's actually unusable by most human beings.

-- Kim Vicente, The Human Factor, p. 17.

  • Share this quote on... Bookmark and Share
  • Get more quotes

Eva Hornecker on Tangible Interaction

Eva Hornecker explains the evolving concept of Tangible Interaction.

Read Eva's insightful entry here..

Help us help you!

  • Spread the word: Bookmark and Share
  • Donate
  • Other ways to help
 

Page information

Page maintainer: The Editorial Team
How to cite/reference this page
URL: http://www.interaction-design.org/references/conferences/proceedings_of_the_25th_annual_international_acm_sigir_conference_on_research_and_development_in_information_retrieval.html