Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval


 
Time and place:

2007
Series:
Conf. description:
SIGIR is the major international forum for the presentation of new research results and the demonstration of new systems and techniques in the field of information retrieval.
Help us!
Do you know when the next conference is? If yes, please add it to the calendar!
Publisher:
EDIT

References from this conference (2007)

The following articles are from "Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval":

 what's this?

Articles

p. 1

Huis, Edwin van (2007): Strategy follows technology. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. p. 1. Available online

In strategic management there has been a debate over many years. Already in 1962 Alfred Chandler had stated: Structure follows Strategy. In the nineteen eighties, Michael Porter modified Chandler's dictum about structure following strategy by introducing a second level of structure: organizational structure follows strategy, which in turn follows structure. So the question became: what is leading what?. Technology has in this debate been seen as a part of either the structure of the organisation itself, or part of the development of the environment in which the organisation tries to survive by adapting. The notion that technological advancement can also change the paradigm as of organisational strategy-development is new. This has mainly to do with the impact of the technological changes on the workflow and procedures of organisations. Never before they were so profound as in our days. Technological change affects us on different levels of our strategic development. I will give three examples of changes that are occurring or have occurred in "Sound and Vision". The first is the introduction of RFID transmitters in admission rings for the Sound and Vision experience. The second is the setup of a back office media asset management, storage and distribution structure for the Public Broadcasters. The third is the development of the archive towards becoming a Media-Application Service Provider.

Copyrights may apply

p. 103-110

Rattenbury, Tye, Good, Nathaniel and Naaman, Mor (2007): Towards automatic extraction of event and place semantics from Flickr tags. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 103-110. Available online

We describe an approach for extracting semantics of tags, unstructured text-labels assigned to resources on the Web, based on each tag's usage patterns. In particular, we focus on the problem of extracting place and event semantics for tags that are assigned to photos on Flickr, a popular photo sharing website that supports time and location (latitude/longitude) metadata. We analyze two methods inspired by well-known burst-analysis techniques and one novel method: Scale-structure Identification. We evaluate the methods on a subset of Flickr data, and show that our Scale-structure Identification method outperforms the existing techniques. The approach and methods described in this work can be used in other domains such as geo-annotated web pages, where text terms can be extracted and associated with usage patterns.

Copyrights may apply

p. 111-118

Fan, Jianping, Gao, Yuli and Luo, Hangzai (2007): Hierarchical classification for automatic image annotation. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 111-118. Available online

In this paper, a hierarchical classification framework has been proposed for bridging the semantic gap effectively and achieving multi-level image annotation automatically. First, the semantic gap between the low-level computable visual features and users' real information needs is partitioned into four smaller gaps, and multiple approaches all are proposed to bridge these smaller gaps more effectively. To learn more reliable contextual relationships between the atomic image concepts and the co-appearances of salient objects, a multi-modal boosting algorithm is proposed. To enable hierarchical image classification and avoid inter-level error transmission, a hierarchical boosting algorithm is proposed by incorporating concept ontology and multi-task learning to achieve hierarchical image classifier training with automatic error recovery. To bridge the gap between the computable image concepts and the users' real information needs, a novel hyperbolic visualization framework is seamlessly incorporated to enable intuitive query specification and evaluation by acquainting the users with a good global view of large-scale image collections. Our experiments on large-scale image databases have also obtained very positive results.

Copyrights may apply

p. 119-126

He, Xiaofei, Min, Wanli, Cai, Deng and Zhou, Kun (2007): Laplacian optimal design for image retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 119-126. Available online

Relevance feedback is a powerful technique to enhance Content-Based Image Retrieval (CBIR) performance. It solicits the user's relevance judgments on the retrieved images returned by the CBIR systems. The user's labeling is then used to learn a classifier to distinguish between relevant and irrelevant images. However, the top returned images may not be the most informative ones. The challenge is thus to determine which unlabeled images would be the most informative (i.e., improve the classifier the most) if they were labeled and used as training samples. In this paper, we propose a novel active learning algorithm, called Laplacian Optimal Design (LOD), for relevance feedback image retrieval. Our algorithm is based on a regression model which minimizes the least square error on the measured (or, labeled) images and simultaneously preserves the local geometrical structure of the image space. Specifically, we assume that if two images are sufficiently close to each other, then their measurements (or, labels) are close as well. By constructing a nearest neighbor graph, the geometrical structure of the image space can be described by the graph Laplacian. We discuss how results from the field of optimal experimental design may be used to guide our selection of a subset of images, which gives us the most amount of information. Experimental results on Corel database suggest that the proposed approach achieves higher precision in relevance feedback image retrieval.

Copyrights may apply

p. 127-134

Turpin, Andrew, Tsegay, Yohannes, Hawking, David and Williams, Hugh E. (2007): Fast generation of result snippets in web search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 127-134. Available online

The presentation of query biased document snippets as part of results pages presented by search engines has become an expectation of search engine users. In this paper we explore the algorithms and data structures required as part of a search engine to allow efficient generation of query biased snippets. We begin by proposing and analysing a document compression method that reduces snippet generation time by 58% over a baseline using the zlib compression library. These experiments reveal that finding documents on secondary storage dominates the total cost of generating snippets, and so caching documents in RAM is essential for a fast snippet generation process. Using simulation, we examine snippet generation performance for different size RAM caches. Finally we propose and analyse document reordering and compaction, revealing a scheme that increases the number of document cache hits with only a marginal affect on snippet quality. This scheme effectively doubles the number of documents that can fit in a fixed size cache.

Copyrights may apply

p. 135-142

Clarke, Charles L. A., Agichtein, Eugene, Dumais, Susan and White, Ryen W. (2007): The influence of caption features on clickthrough patterns in web search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 135-142. Available online

Web search engines present lists of captions, comprising title, snippet, and URL, to help users decide which search results to visit. Understanding the influence of features of these captions on Web search behavior may help validate algorithms and guidelines for their improved generation. In this paper we develop a methodology to use clickthrough logs from a commercial search engine to study user behavior when interacting with search result captions. The findings of our study suggest that relatively simple caption features such as the presence of all terms query terms, the readability of the snippet, and the length of the URL shown in the caption, can significantly influence users' Web search behavior.

Copyrights may apply

p. 143-150

Wan, Xiaojun and Yang, Jianwu (2007): CollabSum: exploiting multiple document clustering for collaborative single document summarizations. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 143-150. Available online

Almost all existing methods conduct the summarization tasks for single documents separately without interactions for each document under the assumption that the documents are considered independent of each other. This paper proposes a novel framework called CollabSum for collaborative single document summarizations by making use of mutual influences of multiple documents within a cluster context. In this study, CollabSum is implemented by first employing the clustering algorithm to obtain appropriate document clusters and then exploiting the graph-ranking based algorithm for collaborative document summarizations within each cluster. Both the with-document and cross-document relationships between sentences are incorporated in the algorithm. Experiments on the DUC2001 and DUC2002 datasets demonstrate the encouraging performance of the proposed approach. Different clustering algorithms have been investigated and we find that the summarization performance relies positively on the quality of document cluster.

Copyrights may apply

p. 15-22

Bai, Jing, Nie, Jian-Yun, Cao, Guihong and Bouchard, Hugues (2007): Using query contexts in information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 15-22. Available online

User query is an element that specifies an information need, but it is not the only one. Studies in literature have found many contextual factors that strongly influence the interpretation of a query. Recent studies have tried to consider the user's interests by creating a user profile. However, a single profile for a user may not be sufficient for a variety of queries of the user. In this study, we propose to use query-specific contexts instead of user-centric ones, including context around query and context within query. The former specifies the environment of a query such as the domain of interest, while the latter refers to context words within the query, which is particularly useful for the selection of relevant term relations. In this paper, both types of context are integrated in an IR model based on language modeling. Our experiments on several TREC collections show that each of the context factors brings significant improvements in retrieval effectiveness.

Copyrights may apply

p. 151-158

Teevan, Jaime, Adar, Eytan, Jones, Rosie and Potts, Michael A. S. (2007): Information re-retrieval: repeat queries in Yahoo's logs. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 151-158. Available online

People often repeat Web searches, both to find new information on topics they have previously explored and to re-find information they have seen in the past. The query associated with a repeat search may differ from the initial query but can nonetheless lead to clicks on the same results. This paper explores repeat search behavior through the analysis of a one-year Web query log of 114 anonymous users and a separate controlled survey of an additional 119 volunteers. Our study demonstrates that as many as 40% of all queries are re-finding queries. Re-finding appears to be an important behavior for search engines to explicitly support, and we explore how this can be done. We demonstrate that changes to search engine results can hinder re-finding, and provide a way to automatically detect repeat searches and predict repeat clicks.

Copyrights may apply

p. 159-166

White, Ryen W., Bilenko, Mikhail and Cucerzan, Silviu (2007): Studying the use of popular destinations to enhance web search interaction. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 159-166. Available online

We present a novel Web search interaction feature which, for a given query, provides links to websites frequently visited by other users with similar information needs. These popular destinations complement traditional search results, allowing direct navigation to authoritative resources for the query topic. Destinations are identified using the history of search and browsing behavior of many users over an extended time period, whose collective behavior provides a basis for computing source authority. We describe a user study which compared the suggestion of destinations with the previously proposed suggestion of related queries, as well as with traditional, unaided Web search. Results show that search enhanced by destination suggestions outperforms other systems for exploratory tasks, with best performance obtained from mining past user behavior at query-level granularity.

Copyrights may apply

p. 167-174

Schockaert, Steven and Cock, Martine De (2007): Neighborhood restrictions in geographic IR. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 167-174. Available online

Geographic information retrieval (GIR) systems allow users to specify a geographic context, in addition to a more traditional query, enabling the system to pinpoint interesting search results whose relevancy is location-dependent. In particular local search services have become a widely used mechanism to find businesses, such as hotels, restaurants, and shops, which satisfy a geographical restriction. Unfortunately, many useful types of geographic restrictions are currently not supported in these systems, including restrictions that specify the neighborhood in which the business should be located. As the boundaries of city neighborhoods are not readily available, automated techniques to construct representations of the spatial extent of neighborhoods are required to support this kind of restrictions. In this paper, we propose such a technique, using fuzzy footprints to cope with the inherent vagueness of most neighborhood boundaries, and we provide experimental results that demonstrate the potential of our technique in a local search setting.

Copyrights may apply

p. 175-182

Strohman, Trevor and Croft, W. Bruce (2007): Efficient document retrieval in main memory. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 175-182. Available online

Disk access performance is a major bottleneck in traditional information retrieval systems. Compared to system memory, disk bandwidth is poor, and seek times are worse. We circumvent this problem by considering query evaluation strategies in main memory. We show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. We evaluate our techniques using Galago, a new retrieval system designed for efficient query processing. Our system achieves a 69% improvement in query throughput over previous methods.

Copyrights may apply

p. 183-190

Baeza-Yates, Ricardo, Gionis, Aristides, Junqueira, Flavio, Murdock, Vanessa, Plachouras, Vassilis and Silvestri, Fabrizio (2007): The impact of caching on search engines. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 183-190. Available online

In this paper we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs.caching posting lists. Using a query log spanning a whole year we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of finding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log affect the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.

Copyrights may apply

p. 191-198

Ntoulas, Alexandros and Cho, Junghoo (2007): Pruning policies for two-tiered inverted index with correctness guarantee. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 191-198. Available online

The Web search engines maintain large-scale inverted indexes which are queried thousands of times per second by users eager for information. In order to cope with the vast amounts of query loads, search engines prune their index to keep documents that are likely to be returned as top results, and use this pruned index to compute the first batches of results. While this approach can improve performance by reducing the size of the index, if we compute the top results only from the pruned index we may notice a significant degradation in the result quality: if a document should be in the top results but was not included in the pruned index, it will be placed behind the results computed from the pruned index. Given the fierce competition in the online search market, this phenomenon is clearly undesirable. In this paper, we study how we can avoid any degradation of result quality due to the pruning-based performance optimization, while still realizing most of its benefit. Our contribution is a number of modifications in the pruning techniques for creating the pruned index and a new result computation algorithm that guarantees that the top-matching pages are always placed at the top search results, even though we are computing the first batch from the pruned index most of the time. We also show how to determine the optimal size of a pruned index and we experimentally evaluate our algorithms on a collection of 130 million Web pages.

Copyrights may apply

p. 199-206

Sun, Bingjun, Mitra, Prasenjit, Giles, Clyde Lee, Yen, John and Zha, Hongyuan (2007): Topic segmentation with shared topic detection and alignment of multiple documents. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 199-206. Available online

Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available in many domains. In this paper, we introduce a novel unsupervised method for shared topic detection and topic segmentation of multiple similar documents based on mutual information (MI) and weighted mutual information (WMI) that is a combination of MI and term weights. The basic idea is that the optimal segmentation maximizes MI (or WMI). Our approach can detect shared topics among documents. It can find the optimal boundaries in a document, and align segments among documents at the same time. It also can handle single-document segmentation as a special case of the multi-document segmentation and alignment. Our methods can identify and strengthen cue terms that can be used for segmentation and partially remove stop words by using term weights based on entropy learned from multiple documents. Our experimental results show that our algorithm works well for the tasks of single-document segmentation, shared topic detection, and multi-document segmentation. Utilizing information from multiple documents can tremendously improve the performance of topic segmentation, and using WMI is even better than using MI for the multi-document segmentation.

Copyrights may apply

p. 2

Jones, Karen Sparck (2007): 2007 Athena Lecturer Award introduction. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. p. 2. Available online

Karen was aware of the severity of her illness. Within days of being informed of the Athena Award outcome and of the award of the BCS Lovelace Medal (she received both notifications on the same day in late February), Karen set about creating a video record of her acceptance presentation. Karen passed away on 4 April 2007, and was unable to realize her goal of attending SIGIR 2007 and accepting the Athena Lecturer Award in person. Please join with us as we remember the rich life and many contributions of Karen Spärck Jones -- scholar, academic, and friend. Further Information * http://www.cl.cam.ac.uk/~ksj21/ * http://en.wikipedia.org/wiki/Karen_Sparck_Jones * http://campus.acm.org/public/pressroom/press_releases/3_2007/athena2007.cfm * http://www.cl.cam.ac.uk/misc/obituaries/sparck-jones/

Copyrights may apply

p. 207-214

He, Qi, Chang, Kuiyu and Lim, Ee-Peng (2007): Analyzing feature trajectories for event detection. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 207-214. Available online

We consider the problem of analyzing word trajectories in both time and frequency domains, with the specific goal of identifying important and less-reported, periodic and aperiodic words. A set of words with identical trends can be grouped together to reconstruct an event in a completely un-supervised manner. The document frequency of each word across time is treated like a time series, where each element is the document frequency -- inverse document frequency (DFIDF) score at one time point. In this paper, we 1) first applied spectral analysis to categorize features for different event characteristics: important and less-reported, periodic and aperiodic; 2) modeled aperiodic features with Gaussian density and periodic features with Gaussian mixture densities, and subsequently detected each feature's burst by the truncated Gaussian approach; 3) proposed an unsupervised greedy event detection algorithm to detect both aperiodic and periodic events. All of the above methods can be applied to time series data in general. We extensively evaluated our methods on the 1-year Reuters News Corpus [3] and showed that they were able to uncover meaningful aperiodic and periodic events.

Copyrights may apply

p. 215-222

Zhang, Kuo, Zi, Juan and Wu, Li Gang (2007): New event detection based on indexing-tree and named entity. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 215-222. Available online

New Event Detection (NED) aims at detecting from one or multiple streams of news stories that which one is reported on a new event (i.e. not reported previously). With the overwhelming volume of news available today, there is an increasing need for a NED system which is able to detect new events more efficiently and accurately. In this paper we propose a new NED model to speed up the NED task by using news indexing-tree dynamically. Moreover, based on the observation that terms of different types have different effects for NED task, two term reweighting approaches are proposed to improve NED accuracy. In the first approach, we propose to adjust term weights dynamically based on previous story clusters and in the second approach, we propose to employ statistics on training data to learn the named entity reweighting model for each class of stories. Experimental results on two Linguistic Data Consortium (LDC) datasets TDT2 and TDT3 show that the proposed model can improve both efficiency and accuracy of NED task significantly, compared to the baseline system and other existing systems.

Copyrights may apply

p. 223-230

Huffman, Scott, Lehman, April, Stolboushkin, Alexei, Wong-Toi, Howard, Yang, Fan and Roehrig, Hein (2007): Multiple-signal duplicate detection for search evaluation. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 223-230. Available online

We consider the problem of duplicate document detection for search evaluation. Given a query and a small number of web results for that query, we show how to detect duplicate web documents with precision 0.91 and recall 77. In contrast, Charikar's algorithm, designed for duplicate detection in an indexing pipeline, achieves precision 0.91 but with a recall of 0.58. Our improvement in recall while maintaining high precision comes from combining three ideas. First, because we are only concerned with duplicate detection among results for the same query, the number of pairwise comparisons is small. Therefore we can afford to compute multiple pairwise signals for each pair of documents. A model learned with standard machine-learning techniques improves recall to 0.68 with precision 0.90. Second, most duplicate detection has focused on text analysis of the HTML contents of a document. In some web pages the HTML is not a good indicator of the final contents of the page. We use extended fetching techniques to fill in frames and execute Java script. Including signals based on our richer fetches further improves the recall to 0.75 and the precision to 0.91. Finally, we also explore using signals based on the query. Comparing contextual snippets based on the richer fetches improves the recall to 0.77. We show that the overall accuracy of this final model approaches that of human judges.

Copyrights may apply

p. 23-30

Elsweiler, David and Ruthven, Ian (2007): Towards task-based personal information management evaluations. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 23-30. Available online

Personal Information Management (PIM) is a rapidly growing area of research concerned with how people store, manage and refind information. A feature of PIM research is that many systems have been designed to assist users manage and refind information, but very few have been evaluated. This has been noted by several scholars and explained by the difficulties involved in performing PIM evaluations. The difficulties include that people re-find information from within unique personal collections; researchers know little about the tasks that cause people to re-find information; and numerous privacy issues concerning personal information. In this paper we aim to facilitate PIM evaluations by addressing each of these difficulties. In the first part, we present a diary study of information re-finding tasks. The study examines the kind of tasks that require users to refind information and produces a taxonomy of refinding tasks for email messages and web pages. In the second part, we propose a task-based evaluation methodology based on our findings and examine the feasibility of the approach using two different methods of task creation.

Copyrights may apply

p. 231-238

Broder, Andrei Z., Fontoura, Marcus, Gabrilovich, Evgeniy, Joshi, Amruta, Josifovski, Vanja and Zhang, Tong (2007): Robust classification of rare queries using web knowledge. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 231-238. Available online

We propose a methodology for building a practical robust query classification system that can identify thousands of query classes with reasonable accuracy, while dealing in real-time with the query volume of a commercial web search engine. We use a blind feedback technique: given a query, we determine its topic by classifying the web search results retrieved by the query. Motivated by the needs of search advertising, we primarily focus on rare queries, which are the hardest from the point of view of machine learning, yet in aggregation account for a considerable fraction of search engine traffic. Empirical evaluation confirms that our methodology yields a considerably higher classification accuracy than previously reported. We believe that the proposed methodology will lead to better matching of online ads to rare queries and overall to a better user experience.

Copyrights may apply

p. 239-246

Craswell, Nick and Szummer, Martin (2007): Random walks on the click graph. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 239-246. Available online

Search engines can record which documents were clicked for which query, and use these query-document pairs as "soft" relevance judgments. However, compared to the true judgments, click logs give noisy and sparse relevance information. We apply a Markov random walk model to a large click log, producing a probabilistic ranking of documents for a given query. A key advantage of the model is its ability to retrieve relevant documents that have not yet been clicked for that query and rank those effectively. We conduct experiments on click logs from image search, comparing our ("backward") random walk model to a different ("forward") random walk, varying parameters such as walk length and self-transition probability. The most effective combination is a long backward walk with high self-transition probability.

Copyrights may apply

p. 247-254

Yuan, Xiaojun and Belkin, Nicholas J. (2007): Supporting multiple information-seeking strategies in a single system framework. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 247-254. Available online

This paper reports on an experiment comparing the retrieval effectiveness of an interactive information retrieval (IIR) system which adapts to support different information seeking strategies, with that of a standard baseline IIR system. The experiment, with 32 subjects each searching on 8 different topics, indicates that using the integrated IIR system resulted in significantly better performance, including user satisfaction with search results, significantly more effective interaction, and significantly better usability than using the baseline system.

Copyrights may apply

p. 255-262

White, Ryen W. and Morris, Dan (2007): Investigating the querying and browsing behavior of advanced search engine users. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 255-262. Available online

One way to help all users of commercial Web search engines be more successful in their searches is to better understand what those users with greater search expertise are doing, and use this knowledge to benefit everyone. In this paper we study the interaction logs of advanced search engine users (and those not so advanced) to better understand how these user groups search. The results show that there are marked differences in the queries, result clicks, post-query browsing, and search success of users we classify as advanced (based on their use of query operators), relative to those classified as non-advanced. Our findings have implications for how advanced users should be supported during their searches, and how their interactions could be used to help searchers of all experience levels find more relevant information and learn improved searching strategies.

Copyrights may apply

p. 263-270

Tan, Bin, Velivelli, Atulya, Fang, Hui and Zhai, Chengxiang (2007): Term feedback for information retrieval with language models. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 263-270. Available online

In this paper we study term-based feedback for information retrieval in the language modeling approach. With term feedback a user directly judges the relevance of individual terms without interaction with feedback documents, taking full control of the query expansion process. We propose a cluster-based method for selecting terms to present to the user for judgment, as well as effective algorithms for constructing refined query language models from user term feedback. Our algorithms are shown to bring significant improvement in retrieval accuracy over a non-feedback baseline, and achieve comparable performance to relevance feedback. They are helpful even when there are no relevant documents in the top.

Copyrights may apply

p. 271-278

Yue, Yisong, Finley, Thomas, Radlinski, Filip and Joachims, Thorsten (2007): A support vector method for optimizing average precision. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 271-278. Available online

Machine learning is commonly used to improve ranked retrieval systems. Due to computational difficulties, few learning techniques have been developed to directly optimize for mean average precision (MAP), despite its widespread use in evaluating such systems. Existing approaches optimizing MAP either do not find a globally optimal solution, or are computationally expensive. In contrast, we present a general SVM learning algorithm that efficiently finds a globally optimal solution to a straightforward relaxation of MAP. We evaluate our approach using the TREC 9 and TREC 10 Web Track corpora (WT10g), comparing against SVMs optimized for accuracy and ROCArea. In most cases we show our method to produce statistically significant improvements in MAP scores.

Copyrights may apply

p. 279-286

Qin, Tao, Zhang, Xu-Dong, Wang, De-Sheng, Liu, Tie-Yan, Lai, Wei and Li, Hang (2007): Ranking with multiple hyperplanes. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 279-286. Available online

The central problem for many applications in Information Retrieval is ranking and learning to rank is considered as a promising approach for addressing the issue. Ranking SVM, for example, is a state-of-the-art method for learning to rank and has been empirically demonstrated to be effective. In this paper, we study the issue of learning to rank, particularly the approach of using SVM techniques to perform the task. We point out that although Ranking SVM is advantageous, it still has shortcomings. Ranking SVM employs a single hyperplane in the feature space as the model for ranking, which is too simple to tackle complex ranking problems. Furthermore, the training of Ranking SVM is also computationally costly. In this paper, we look at an alternative approach to Ranking SVM, which we call "Multiple Hyperplane Ranker" (MHR), and make comparisons between the two approaches. MHR takes the divide-and-conquer strategy. It employs multiple hyperplanes to rank instances and finally aggregates the ranking results given by the hyperplanes. MHR contains Ranking SVM as a special case, and MHR can overcome the shortcomings which Ranking SVM suffers from. Experimental results on two information retrieval datasets show that MHR can outperform Ranking SVM in ranking.

Copyrights may apply

p. 287-294

Zheng, Zhaohui, Chen, Keke, Sun, Gordon and Zha, Hongyuan (2007): A regression framework for learning ranking functions using relative relevance judgments. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 287-294. Available online

Effective ranking functions are an essential part of commercial search engines. We focus on developing a regression framework for learning ranking functions for improving relevance of search engines serving diverse streams of user queries. We explore supervised learning methodology from machine learning, and we distinguish two types of relevance judgments used as the training data: 1) absolute relevance judgments arising from explicit labeling of search results; and 2) relative relevance judgments extracted from user click throughs of search results or converted from the absolute relevance judgments. We propose a novel optimization framework emphasizing the use of relative relevance judgments. The main contribution is the development of an algorithm based on regression that can be applied to objective functions involving preference data, i.e., data indicating that a document is more relevant than another with respect to a query. Experimental results are carried out using data sets obtained from a commercial search engine. Our results show significant improvements of our proposed methods over some existing methods.

Copyrights may apply

p. 295-302

Tao, Tao and Zhai, Chengxiang (2007): An exploration of proximity measures in information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 295-302. Available online

In most existing retrieval models, documents are scored primarily based on various kinds of term statistics such as within-document frequencies, inverse document frequencies, and document lengths. Intuitively, the proximity of matched query terms in a document can also be exploited to promote scores of documents in which the matched query terms are close to each other. Such a proximity heuristic, however, has been largely under-explored in the literature; it is unclear how we can model proximity and incorporate a proximity measure into an existing retrieval model. In this paper, we systematically explore the query term proximity heuristic. Specifically, we propose and study the effectiveness of five different proximity measures, each modeling proximity from a different perspective. We then design two heuristic constraints and use them to guide us in incorporating the proposed proximity measures into an existing retrieval model. Experiments on five standard TREC test collections show that one of the proposed proximity measures is indeed highly correlated with document relevance, and by incorporating it into the KL-divergence language model and the Okapi BM25 model, we can significantly improve retrieval performance.

Copyrights may apply

p. 3-6

Jones, Karen Sparck (2007): Natural language and the information layer. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 3-6. Available online

This talk is in response to two Awards: the Association for Computing Machinery's Athena Award, given by the ACM's Committee on Women, on the nomination of the ACM Special Interest Group on Information Retrieval; and the British Computer Society's Lovelace Medal. It is a very great honour to have been given these awards. I would like to say how much I appreciate this recognition. Thank you, ACM and BCS. I would particularly like to say, and I hope the ACM will not take this amiss, how I appreciate being the first woman to be awarded the BCS Lovelace Medal. The awards carry the opportunity to give a lecture with them. I deeply regret not being able to do this live and in a way to suit each specifically. But I hope the single video based on this talk will go a little way as a substitute for two proper lectures. My talk has three parts: on the first phase of natural language processing research and its lessons; on subsequent developments up to the present and their lessons; and on where we are now and what I think are the wider implications for the future.

Copyrights may apply

p. 303-310

Collins-Thompson, Kevyn and Callan, Jamie (2007): Estimation and use of uncertainty in pseudo-relevance feedback. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 303-310. Available online

Existing pseudo-relevance feedback methods typically perform averaging over the top-retrieved documents, but ignore an important statistical dimension: the risk or variance associated with either the individual document models, or their combination. Treating the baseline feedback method as a black box, and the output feedback model as a random variable, we estimate a posterior distribution for the feed-back model by resampling a given query's top-retrieved documents, using the posterior mean or mode as the enhanced feedback model. We then perform model combination over several enhanced models, each based on a slightly modified query sampled from the original query. We find that resampling documents helps increase individual feedback model precision by removing noise terms, while sampling from the query improves robustness (worst-case performance) by emphasizing terms related to multiple query aspects. The result is a meta-feedback algorithm that is both more robust and more precise than the original strong baseline method.

Copyrights may apply

p. 31-38

Yang, Yiming, Lad, Abhimanyu, Lao, Ni, Harpale, Abhay, Kisiel, Bryan and Rogati, Monica (2007): Utility-based information distillation over temporally sequenced documents. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 31-38. Available online

This paper examines a new approach to information distillation over temporally ordered documents, and proposes a novel evaluation scheme for such a framework. It combines the strengths of and extends beyond conventional adaptive filtering, novelty detection and non-redundant passage ranking with respect to long-lasting information needs ("tasks" with multiple queries). Our approach supports fine-grained user feedback via highlighting of arbitrary spans of text, and leverages such information for utility optimization in adaptive settings. For our experiments, we defined hypothetical tasks based on news events in the TDT4 corpus, with multiple queries per task. Answer keys (nuggets) were generated for each query and a semi-automatic procedure was used for acquiring rules that allow automatically matching nuggets against system responses. We also propose an extension of the NDCG metric for assessing the utility of ranked passages as a combination of relevance and novelty. Our results show encouraging utility enhancements using the new approach, compared to the baseline systems without incremental learning or the novelty detection components.

Copyrights may apply

p. 311-318

Metzler, Donald and Croft, W. Bruce (2007): Latent concept expansion using markov random fields. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 311-318. Available online

Query expansion, in the form of pseudo-relevance feedback or relevance feedback, is a common technique used to improve retrieval effectiveness. Most previous approaches have ignored important issues, such as the role of features and the importance of modeling term dependencies. In this paper, we propose a robust query expansion technique based on the Markov random field model for information retrieval. The technique, called latent concept expansion, provides a mechanism for modeling term dependencies during expansion. Furthermore, the use of arbitrary features within the model provides a powerful framework for going beyond simple term occurrence features that are implicitly used by most other expansion techniques. We evaluate our technique against relevance models, a state-of-the-art language modeling query expansion technique. Our model demonstrates consistent and significant improvements in retrieval effectiveness across several TREC data sets. We also describe how our technique can be used to generate meaningful multi-term concepts for tasks such as query suggestion/reformulation.

Copyrights may apply

p. 319-326

Mei, Qiaozhu, Fang, Hui and Zhai, Chengxiang (2007): A study of Poisson query generation model for information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 319-326. Available online

Many variants of language models have been proposed for information retrieval. Most existing models are based on multinomial distribution and would score documents based on query likelihood computed based on a query generation probabilistic model. In this paper, we propose and study a new family of query generation models based on Poisson distribution. We show that while in their simplest forms, the new family of models and the existing multinomial models are equivalent. However, based on different smoothing methods, the two families of models behave differently. We show that the Poisson model has several advantages, including naturally accommodating per-term smoothing and modeling accurate background more efficiently. We present several variants of the new model corresponding to different smoothing methods, and evaluate them on four representative TREC test collections. The results show that while their basic models perform comparably, the Poisson model can out perform multinomial model with per-term smoothing. The performance can be further improved with two-stage smoothing.

Copyrights may apply

p. 327-334

Lin, Jimmy and Zhang, Pengyi (2007): Deconstructing nuggets: the stability and reliability of complex question answering evaluation. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 327-334. Available online

A methodology based on "information nuggets" has recently emerged as the de facto standard by which answers to complex questions are evaluated. After several implementations in the TREC question answering tracks, the community has gained a better understanding of its many characteristics. This paper focuses on one particular aspect of the evaluation: the human assignment of nuggets to answer strings, which serves as the basis of the F-score computation. As a byproduct of the TREC 2006 ciQA task, identical answer strings were independently evaluated twice, which allowed us to assess the consistency of human judgments. Based on these results, we explored simulations of assessor behavior that provide a method to quantify scoring variations. Understanding these variations in turn lets researchers be more confident in their comparisons of systems.

Copyrights may apply

p. 335-342

Kor, Kian-Wei and Chua, Tat-Seng (2007): Interesting nuggets and their impact on definitional question answering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 335-342. Available online

Current approaches to identifying definitional sentences in the context of Question Answering mainly involve the use of linguistic or syntactic patterns to identify informative nuggets. This is insufficient as they do not address the novelty factor that a definitional nugget must also possess. This paper proposes to address the deficiency by building a "Human Interest Model" from external knowledge. It is hoped that such a model will allow the computation of human interest in the sentence with respect to the topic. We compare and contrast our model with current definitional question answering models to show that interestingness plays an important factor in definitional question answering.

Copyrights may apply

p. 343-350

Ko, Jeongwoo, Nyberg, Eric and Si, Luo (2007): A probabilistic graphical model for joint answer ranking in question answering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 343-350. Available online

Graphical models have been applied to various information retrieval and natural language processing tasks in the recent literature. In this paper, we apply a probabilistic graphical model for answer ranking in question answering. This model estimates the joint probability of correctness of all answer candidates, from which the probability of correctness of an individual candidate can be inferred. The joint prediction model can estimate both the correctness of individual answers as well as their correlations, which enables a list of accurate and comprehensive answers. This model was compared with a logistic regression model which directly estimates the probability of correctness of each individual answer candidate. An extensive set of empirical results based on TREC questions demonstrates the effectiveness of the joint model for answer ranking. Furthermore, we combine the joint model with the logistic regression model to improve the efficiency and accuracy of answer ranking.

Copyrights may apply

p. 351-358

Bilotti, Matthew W., Ogilvie, Paul, Callan, Jamie and Nyberg, Eric (2007): Structured retrieval for question answering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 351-358. Available online

Bag-of-words retrieval is popular among Question Answering (QA) system developers, but it does not support constraint checking and ranking on the linguistic and semantic information of interest to the QA system. We present an approach to retrieval for QA, applying structured retrieval techniques to the types of text annotations that QA systems use. We demonstrate that the structured approach can retrieve more relevant results, more highly ranked, compared with bag-of-words, on a sentence retrieval task. We also characterize the extent to which structured retrieval effectiveness depends on the quality of the annotations.

Copyrights may apply

p. 359-366

Bompada, Tanuja, Chang, Chi-Chao, Chen, John, Kumar, Ravi and Shenoy, Rajesh (2007): On the robustness of relevance measures with incomplete judgments. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 359-366. Available online

We investigate the robustness of three widely used IR relevance measures for large data collections with incomplete judgments. The relevance measures we consider are the bpref measure introduced by Buckley and Voorhees [7], the inferred average precision (infAP) introduced by Aslam and Yilmaz [4], and the normalized discounted cumulative gain (NDCG) measure introduced by Järvelin and Kekäläinen [8]. Our main results show that NDCG consistently performs better than both bpref and infAP. The experiments are performed on standard TREC datasets, under different levels of incompleteness of judgments, and using two different evaluation methods, namely, the Kendall correlation measures order between system rankings and pairwise statistical significance testing; the latter may be of independent interest.

Copyrights may apply

p. 367-374

Bodoff, David and Li, Pu (2007): Test theory for assessing IR test collections. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 367-374. Available online

How good is an IR test collection? A series of papers in recent years has addressed the question by empirically enumerating the consistency of performance comparisons using alternate subsets of the collection. In this paper we propose using Test Theory, which is based on analysis of variance and is specifically designed to assess test collections. Using the method, we not only can measure test reliability after the fact, but we can estimate the test collection's reliability before it is even built or used. We can also determine an optimal allocation of resources before the fact, e.g. whether to invest in more judges or queries. The method, which is in widespread use in the field of educational testing, complements data-driven approaches to assessing test collections. Whereas the data-driven method focuses on test results, test theory focuses on test designs. It offers unique practical results, as well as insights about the variety and implications of alternative test designs.

Copyrights may apply

p. 375-382

Moffat, Alistair, Webber, William and Zobel, Justin (2007): Strategic system comparisons via targeted relevance judgments. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 375-382. Available online

Relevance judgments are used to compare text retrieval systems. Given a collection of documents and queries, and a set of systems being compared, a standard approach to forming judgments is to manually examine all documents that are highly ranked by any of the systems. However, not all of these relevance judgments provide the same benefit to the final result, particularly if the aim is to identify which systems are best, rather than to fully order them. In this paper we propose new experimental methodologies that can significantly reduce the volume of judgments required in system comparisons. Using rank-biased precision, a recently proposed effectiveness measure, we show that judging around 200 documents for each of 50 queries in a TREC-scale system evaluation containing over 100 runs is sufficient to identify the best systems.

Copyrights may apply

p. 383-390

Tsai, Ming-Feng, Liu, Tie-Yan, Qin, Tao, Chen, Hsinchun and Ma, Wei-Ying (2007): FRank: a ranking method with fidelity loss. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 383-390. Available online

Ranking problem is becoming important in many fields, especially in information retrieval (IR). Many machine learning techniques have been proposed for ranking problem, such as RankSVM, RankBoost, and RankNet. Among them, RankNet, which is based on a probabilistic ranking framework, is leading to promising results and has been applied to a commercial Web search engine. In this paper we conduct further study on the probabilistic ranking framework and provide a novel loss function named fidelity loss for measuring loss of ranking. The fidelity loss not only inherits effective properties of the probabilistic ranking framework in RankNet, but possesses new properties that are helpful for ranking. This includes the fidelity loss obtaining zero for each document pair, and having a finite upper bound that is necessary for conducting query-level normalization. We also propose an algorithm named FRank based on a generalized additive model for the sake of minimizing the fidelity loss and learning an effective ranking function. We evaluated the proposed algorithm for two datasets: TREC dataset and real Web search dataset. The experimental results show that the proposed FRank algorithm outperforms other learning-based ranking methods on both conventional IR problem and Web search.

Copyrights may apply

p. 39-46

Ma, Hao, King, Irwin and Lyu, Michael R. (2007): Effective missing data prediction for collaborative filtering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 39-46. Available online

Memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. Usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. This paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. First, we use the enhanced Pearson Correlation Coefficient (PCC) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. Second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. In this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. We also address how to predict the missing data by employing a combination of user and item information. Finally, empirical studies on dataset MovieLens have shown that our newly proposed method outperforms other state-of-the-art collaborative filtering algorithms and it is more robust against data sparsity.

Copyrights may apply

p. 391-398

Xu, Jun and Li, Hang (2007): AdaRank: a boosting algorithm for information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 391-398. Available online

In this paper we address the issue of learning to rank for document retrieval. In the task, a model is automatically created with some training data and then is utilized for ranking of documents. The goodness of a model is usually evaluated with performance measures such as MAP (Mean Average Precision) and NDCG (Normalized Discounted Cumulative Gain). Ideally a learning algorithm would train a ranking model that could directly optimize the performance measures with respect to the training data. Existing methods, however, are only able to train ranking models by minimizing loss functions loosely related to the performance measures. For example, Ranking SVM and RankBoost train ranking models by minimizing classification errors on instance pairs. To deal with the problem, we propose a novel learning algorithm within the framework of boosting, which can minimize a loss function directly defined on the performance measures. Our algorithm, referred to as AdaRank, repeatedly constructs 'weak rankers' on the basis of reweighted training data and finally linearly combines the weak rankers for making ranking predictions. We prove that the training process of AdaRank is exactly that of enhancing the performance measure used. Experimental results on four benchmark datasets show that AdaRank significantly outperforms the baseline methods of BM25, Ranking SVM, and RankBoost.

Copyrights may apply

p. 399-406

Almeida, Humberto Mossri de, Goncalves, Marcos Andre, Cristo, Marco and Calado, Pavel (2007): A combined component approach for finding collection-adapted ranking functions based on genetic programming. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 399-406. Available online

In this paper, we propose a new method to discover collection-adapted ranking functions based on Genetic Programming (GP). Our Combined Component Approach (CCA) is based on the combination of several term-weighting components (i.e., term frequency, collection frequency, normalization) extracted from well-known ranking functions. In contrast to related work, the GP terminals in our CCA are not based on simple statistical information of a document collection, but on meaningful, effective, and proven components. Experimental results show that our approach was able to outperform standard TF-IDF, BM25 and another GP-based approach in two different collections. CCA obtained improvements in mean average precision up to 40.87% for the TREC-8 collection, and 24.85% for the WBR99 collection (a large Brazilian Web collection), over the baseline functions. The CCA evolution process also was able to reduce the over-training, commonly found in machine learning methods, especially genetic programming, and to converge faster than the other GP-based approach used for comparison.

Copyrights may apply

p. 407-414

Geng, Xiubo, Liu, Tie-Yan, Qin, Tao and Li, Hang (2007): Feature selection for ranking. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 407-414. Available online

Ranking is a very important topic in information retrieval. While algorithms for learning ranking models have been intensively studied, this is not the case for feature selection, despite of its importance. The reality is that many feature selection methods used in classification are directly applied to ranking. We argue that because of the striking differences between ranking and classification, it is better to develop different feature selection methods for ranking. To this end, we propose a new feature selection method in this paper. Specifically, for each feature we use its value to rank the training instances, and define the ranking accuracy in terms of a performance measure or a loss function as the importance of the feature. We also define the correlation between the ranking results of two features as the similarity between them. Based on the definitions, we formulate the feature selection issue as an optimization problem, for which it is to find the features with maximum total importance scores and minimum total similarity scores. We also demonstrate how to solve the optimization problem in an efficient way. We have tested the effectiveness of our feature selection method on two information retrieval datasets and with two ranking models. Experimental results show that our method can outperform traditional feature selection methods for the ranking task.

Copyrights may apply

p. 415-422

Sculley, D. and Wachman, Gabriel M. (2007): Relaxed online SVMs for spam filtering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 415-422. Available online

Spam is a key problem in electronic communication, including large-scale email systems and the growing number of blogs. Content-based filtering is one reliable method of combating this threat in its various forms, but some academic researchers and industrial practitioners disagree on how best to filter spam. The former have advocated the use of Support Vector Machines (SVMs) for content-based filtering, as this machine learning methodology gives state-of-the-art performance for text classification. However, similar performance gains have yet to be demonstrated for online spam filtering. Additionally, practitioners cite the high cost of SVMs as reason to prefer faster (if less statistically robust) Bayesian methods. In this paper, we offer a resolution to this controversy. First, we show that online SVMs indeed give state-of-the-art classification performance on online spam filtering on large benchmark data sets. Second, we show that nearly equivalent performance may be achieved by a Relaxed Online SVM (ROSVM) at greatly reduced computational cost. Our results are experimentally verified on email spam, blog spam, and splog detection tasks.

Copyrights may apply

p. 423-430

Castillo, Carlos, Donato, Debora, Gionis, Aristides, Murdock, Vanessa and Silvestri, Fabrizio (2007): Know your neighbors: web spam detection using the web topology. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 423-430. Available online

Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.

Copyrights may apply

p. 431-438

Yang, Haixuan, King, Irwin and Lyu, Michael R. (2007): DiffusionRank: a possible penicillin for web spamming. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 431-438. Available online

While the PageRank algorithm has proven to be very effective for ranking Web pages, the rank scores of Web pages can be manipulated. To handle the manipulation problem and to cast a new insight on the Web structure, we propose a ranking algorithm called DiffusionRank. DiffusionRank is motivated by the heat diffusion phenomena, which can be connected to Web ranking because the activities flow on the Web can be imagined as heat flow, the link from a page to another can be treated as the pipe of an air-conditioner, and heat flow can embody the structure of the underlying Web graph. Theoretically we show that DiffusionRank can serve as a generalization of PageRank when the heat diffusion co-efficient Y tends to infinity. In such a case 1=Y= 0, DiffusionRank (PageRank) has low ability of anti-manipulation. When Y = 0, DiffusionRank obtains the highest ability of anti-manipulation, but in such a case, the web structure is completely ignored. Consequently, Y is an interesting factor that can control the balance between the ability of preserving the original Web and the ability of reducing the effect of manipulation. It is found empirically that, when Y = 1, DiffusionRank has a Penicillin-like effect on the link manipulation. Moreover, DiffusionRank can be employed to find group-to-group relations on the Web, to divide the Web graph into several parts, and to find link communities. Experimental results show that the DiffusionRank algorithm achieves the above mentioned advantages as expected.

Copyrights may apply

p. 439-446

Turnbull, Douglas, Barrington, Luke, Torres, David and Lanckriet, Gert (2007): Towards musical query-by-semantic-description using the CAL500 data set. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 439-446. Available online

Query-by-semantic-description (QBSD)is a natural paradigm for retrieving content from large databases of music. A major impediment to the development of good QBSD systems for music information retrieval has been the lack of a cleanly-labeled, publicly-available, heterogeneous data set of songs and associated annotations. We have collected the Computer Audition Lab 500-song (CAL500) data set by having humans listen to and annotate songs using a survey designed to capture 'semantic associations' between music and words. We adapt the supervised multi-class labeling (SML) model, which has shown good performance on the task of image retrieval, and use the CAL500 data to learn a model for music retrieval. The model parameters are estimated using the weighted mixture hierarchies expectation-maximization algorithm which has been specifically designed to handle real-valued semantic association between words and songs, rather than binary class labels. The output of the SML model, a vector of class-conditional probabilities, can be interpreted as a semantic multinomial distribution over a vocabulary. By also representing a semantic query as a query multinomial distribution, we can quickly rank order the songs in a database based on the Kullback-Leibler divergence between the query multinomial and each song's semantic multinomial. Qualitative and quantitative results demonstrate that our SML model can both annotate a novel song with meaningful words and retrieve relevant songs given a multi-word, text-based query.

Copyrights may apply

p. 447-454

Knees, Peter, Pohle, Tim, Schedl, Markus and Widmer, Gerhard (2007): A music search engine built upon audio-based and web-based similarity measures. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 447-454. Available online

An approach is presented to automatically build a search engine for large-scale music collections that can be queried through natural language. While existing approaches depend on explicit manual annotations and meta-data assigned to the individual audio pieces, we automatically derive descriptions by making use of methods from Web Retrieval and Music Information Retrieval. Based on the ID3 tags of a collection of mp3 files, we retrieve relevant Web pages via Google queries and use the contents of these pages to characterize the music pieces and represent them by term vectors. By incorporating complementary information about acoustic similarity we are able to both reduce the dimensionality of the vector space and improve the performance of retrieval, i.e. the quality of the results. Furthermore, the usage of audio similarity allows us to also characterize audio pieces when there is no associated information found on the Web.

Copyrights may apply

p. 455-462

Azzopardi, Leif, Rijke, Maarten de and Balog, Krisztian (2007): Building simulated queries for known-item topics: an analysis using six european languages. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 455-462. Available online

There has been increased interest in the use of simulated queries for evaluation and estimation purposes in Information Retrieval. However, there are still many unaddressed issues regarding their usage and impact on evaluation because their quality, in terms of retrieval performance, is unlike real queries. In this paper, we focus on methods for building simulated known-item topics and explore their quality against real known-item topics. Using existing generation models as our starting point, we explore factors which may influence the generation of the known-item topic. Informed by this detailed analysis (on six European languages) we propose a model with improved document and term selection properties, showing that simulated known-item topics can be generated that are comparable to real known-item topics. This is a significant step towards validating the potential usefulness of simulated queries: for evaluation purposes, and because building models of querying behavior provides a deeper insight into the querying process so that better retrieval mechanisms can be developed to support the user.

Copyrights may apply

p. 463-470

Gao, Wei, Niu, Cheng, Nie, Jian-Yun, Zhou, Ming, Hu, Jian, Wong, Kam-Fai and Hon, Hsiao-Wuen (2007): Cross-lingual query suggestion using query logs of different languages. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 463-470. Available online

Query suggestion aims to suggest relevant queries for a given query, which help users better specify their information needs. Previously, the suggested terms are mostly in the same language of the input query. In this paper, we extend it to cross-lingual query suggestion (CLQS): for a query in one language, we suggest similar or relevant queries in other languages. This is very important to scenarios of cross-language information retrieval (CLIR) and cross-lingual keyword bidding for search engine advertisement. Instead of relying on existing query translation technologies for CLQS, we present an effective means to map the input query of one language to queries of the other language in the query log. Important monolingual and cross-lingual information such as word translation relations and word co-occurrence statistics, etc. are used to estimate the cross-lingual query similarity with a discriminative model. Benchmarks show that the resulting CLQS system significantly out performs a baseline system based on dictionary-based query translation. Besides, the resulting CLQS is tested with French to English CLIR tasks on TREC collections. The results demonstrate higher effectiveness than the traditional query translation methods.

Copyrights may apply

p. 47-54

Zhang, Yi and Koren, Jonathan (2007): Efficient Bayesian hierarchical user modeling for recommendation system. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 47-54. Available online

A content-based personalized recommendation system learns user specific profiles from user feedback so that it can deliver information tailored to each individual user's interest. A system serving millions of users can learn a better user profile for a new user, or a user with little feedback, by borrowing information from other users through the use of a Bayesian hierarchical model. Learning the model parameters to optimize the joint data likelihood from millions of users is very computationally expensive. The commonly used EM algorithm converges very slowly due to the sparseness of the data in IR applications. This paper proposes a new fast learning technique to learn a large number of individual user profiles. The efficacy and efficiency of the proposed algorithm are justified by theory and demonstrated on actual user data from Netflix and MovieLens.

Copyrights may apply

p. 471-478

Najork, Marc A., Zaragoza, Hugo and Taylor, Michael J. (2007): Hits on the web: how does it compare?. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 471-478. Available online

This paper describes a large-scale evaluation of the effectiveness of HITS in comparison with other link-based ranking algorithms, when used in combination with a state-of-the-art text retrieval algorithm exploiting anchor text. We quantified their effectiveness using three common performance measures: the mean reciprocal rank, the mean average precision, and the normalized discounted cumulative gain measurements. The evaluation is based on two large data sets: a breadth-first search crawl of 463 million web pages containing 17.6 billion hyperlinks and referencing 2.9 billion distinct URLs; and a set of 28,043 queries sampled from a query log, each query having on average 2,383 results, about 17 of which were labeled by judges. We found that HITS outperforms PageRank, but is about as effective as web-page in-degree. The same holds true when any of the link-based features are combined with the text retrieval algorithm. Finally, we studied the relationship between query specificity and the effectiveness of selected features, and found that link-based features perform better for general queries, whereas BM25F performs better for specific queries.

Copyrights may apply

p. 479-486

Mizzaro, Stefano and Robertson, Stephen (2007): Hits hits TREC: exploring IR evaluation results with network analysis. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 479-486. Available online

We propose a novel method of analysing data gathered from TREC or similar information retrieval evaluation experiments. We define two normalized versions of average precision, that we use to construct a weighted bipartite graph of TREC systems and topics. We analyze the meaning of well known -- and somewhat generalized -- indicators from social network analysis on the Systems-Topics graph. We apply this method to an analysis of TREC 8 data; among the results, we find that authority measures systems performance, that hubness of topics reveals that some topics are better than others at distinguishing more or less effective systems, that with current measures a system that wants to be effective in TREC needs to be effective on easy topics, and that by using different effectiveness measures this is no longer the case.

Copyrights may apply

p. 487-494

Zhu, Shenghuo, Yu, Kai, Chi, Yun and Gong, Yihong (2007): Combining content and link for classification using matrix factorization. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 487-494. Available online

The world wide web contains rich textual contents that are interconnected via complex hyperlinks. This huge database violates the assumption held by most of conventional statistical methods that each web page is considered as an independent and identical sample. It is thus difficult to apply traditional mining or learning methods for solving web mining problems, e.g., web page classification, by exploiting both the content and the link structure. The research in this direction has recently received considerable attention but are still in an early stage. Though a few methods exploit both the link structure or the content information, some of them combine the only authority information with the content information, and the others first decompose the link structure into hub and authority features, then apply them as additional document features. Being practically attractive for its great simplicity, this paper aims to design an algorithm that exploits both the content and linkage information, by carrying out a joint factorization on both the linkage adjacency matrix and the document-term matrix, and derives a new representation for web pages in a low-dimensional factor space, without explicitly separating them as content, hub or authority factors. Further analysis can be performed based on the compact representation of web pages. In the experiments, the proposed method is compared with state-of-the-art methods and demonstrates an excellent accuracy in hypertext classification on the WebKB and Cora benchmarks.

Copyrights may apply

p. 495-502

Shokouhi, Milad and Zobel, Justin (2007): Federated text retrieval from uncooperative overlapped collections. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 495-502. Available online

In federated text retrieval systems, the query is sent to multiple collections at the same time. The results returned by collections are gathered and ranked by a central broker that presents them to the user. It is usually assumed that the collections have little overlap. However, in practice collections may share many common documents as either exact or near duplicates, potentially leading to high numbers of duplicates in the final results. Considering the natural band width restrictions and efficiency issues of federated search, sending queries to redundant collections leads to unnecessary costs. We propose a novel method for estimating the rate of over-lap among collections based on sampling. Then, using the estimated overlap statistics, we propose two collection selection methods that aim to maximize the number of unique relevant documents in the final results. We show experimentally that, although our estimates of overlap are not in exact, our suggested techniques can significantly improve the search effectiveness when collections overlap.

Copyrights may apply

p. 503-510

Thomas, Paul and Hawking, David (2007): Evaluating sampling methods for uncooperative collections. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 503-510. Available online

Many server selection methods suitable for distributed information retrieval applications rely, in the absence of cooperation, on the availability of unbiased samples of documents from the constituent collections. We describe a number of sampling methods which depend only on the normal query-response mechanism of the applicable search facilities. We evaluate these methods on a number of collections typical of a personal metasearch application. Results demonstrate that biases exist for all methods, particularly toward longer documents, and that in some cases these biases can be reduced but not eliminated by choice of parameters. We also introduce a new sampling technique, "multiple queries", which produces samples of similar quality to the best current techniques but with significantly reduced cost.

Copyrights may apply

p. 511-518

Shokouhi, Milad, Baillie, Mark and Azzopardi, Leif (2007): Updating collection representations for federated search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 511-518. Available online

To facilitate the search for relevant information across a set of online distributed collections, a federated information retrieval system typically represents each collection, centrally, by a set of vocabularies or sampled documents. Accurate retrieval is therefore related to how precise each representation reflects the underlying content stored in that collection. As collections evolve over time, collection representations should also be updated to reflect any change, however, a current solution has not yet been proposed. In this study we examine both the implications of out-of-date representation sets on retrieval accuracy, as well as proposing three different policies for managing necessary updates. Each policy is evaluated on a testbed of forty-four dynamic collections over an eight-week period. Our findings show that out-of-date representations significantly degrade performance overtime, however, adopting a suitable update policy can minimise this problem.

Copyrights may apply

p. 519-526

Berberich, Klaus, Bedathur, Srikanta, Neumann, Thomas and Weikum, Gerhard (2007): A time machine for text search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 519-526. Available online

Text search over temporally versioned document collections such as web archives has received little attention as a research problem. As a consequence, there is no scalable and principled solution to search such a collection as of a specified time. In this work, we address this shortcoming and propose an efficient solution for time-travel text search by extending the inverted file index to make it ready for temporal search. We introduce approximate temporal coalescing as a tunable method to reduce the index size without significantly affecting the quality of results. In order to further improve the performance of time-travel queries, we introduce two principled techniques to trade off index size for its performance. These techniques can be formulated as optimization problems that can be solved to near-optimality. Finally, our approach is evaluated in a comprehensive series of experiments on two large-scale real-world datasets. Results unequivocally show that our methods make it possible to build an efficient "time machine" scalable to large versioned text collections.

Copyrights may apply

p. 527-534

Stein, Benno (2007): Principles of hash-based text retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 527-534. Available online

Hash-based similarity search reduces a continuous similarity relation to the binary concept "similar or not similar": two feature vectors are considered as similar if they are mapped on the same hash key. From its runtime performance this principle is unequaled -- while being unaffected by dimensionality concerns at the same time. Similarity hashing is applied with great success for near similarity search in large document collections, and it is considered as a key technology for near-duplicate detection and plagiarism analysis. This papers reveals the design principles behind hash-based search methods and presents them in a unified way. We introduce new stress statistics that are suited to analyze the performance of hash-based search methods, and we explain the rationale of their effectiveness. Based on these insights, we show how optimum hash functions for similarity search can be derived. We also present new results of a comparative study between different hash-based search methods.

Copyrights may apply

p. 535-542

Ferragina, Paolo and Venturini, Rossano (2007): Compressed permuterm index. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 535-542. Available online

Recently [Manning et al., 2007] resorted the Permuterm index of Garfield (1976) as a time-efficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wild-card symbol (called, Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because its quadruples the dictionary size. In this paper we propose the Compressed Permuterm Index which solves the Tolerant Retrieval problem in optimal query time, i.e. time proportional to the length of the searched pattern, and space close to the k-th order empirical entropy of the indexed dictionary. Our index can be used to solve also more sophisticated queries which involve several wild-card symbols, or require to prefix-match multiple fields in a database of records. The result is based on an elegant variant of the Burrows-Wheeler Transform defined on a dictionary of strings of variable length, which allows to easily adapt known compressed indexes [Makinen-Navarro, 2007] to solve the Tolerant Retrieval problem. Experiments show that our index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip, bzip or ppmdi. This improves known approaches based on front-coding by more than 50% in absolute space occupancy, still guaranteeing comparable query time.

Copyrights may apply

p. 543-550

Zhou, Yun and Croft, W. Bruce (2007): Query performance prediction in web search environments. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 543-550. Available online

Current prediction techniques, which are generally designed for content-based queries and are typically evaluated on relatively homogenous test collections of small sizes, face serious challenges in web search environments where collections are significantly more heterogeneous and different types of retrieval tasks exist. In this paper, we present three techniques to address these challenges. We focus on performance prediction for two types of queries in web search environments: content-based and Named-Page finding. Our evaluation is mainly performed on the GOV2 collection. In addition to evaluating our models for the two types of queries separately, we consider a more challenging and realistic situation that the two types of queries are mixed together without prior information on query types. To assist prediction under the mixed-query situation, a novel query classifier is adopted. Results show that our prediction of web query performance is substantially more accurate than the current state-of-the-art prediction techniques. Consequently, our paper provides a practical approach to performance prediction in real-world web settings.

Copyrights may apply

p. 55-62

Carterette, Ben (2007): Robust test collections for retrieval evaluation. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 55-62. Available online

Low-cost methods for acquiring relevance judgments can be a boon to researchers who need to evaluate new retrieval tasks or topics but do not have the resources to make thousands of judgments. While these judgments are very useful for a one-time evaluation, it is not clear that they can be trusted when re-used to evaluate new systems. In this work, we formally define what it means for judgments to be reusable: the confidence in an evaluation of new systems can be accurately assessed from an existing set of relevance judgments. We then present a method for augmenting a set of relevance judgments with relevance estimates that require no additional assessor effort. Using this method practically guarantees reusability: with as few as five judgments per topic taken from only two systems, we can reliably evaluate a larger set of ten systems. Even the smallest sets of judgments can be useful for evaluation of new systems.

Copyrights may apply

p. 551-558

Balog, Krisztian, Bogers, Toine, Azzopardi, Leif, Rijke, Maarten de and Bosch, Antal van den (2007): Broad expertise retrieval in sparse data environments. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 551-558. Available online

Expertise retrieval has been largely unexplored on data other than the W3C collection. At the same time, many intranets of universities and other knowledge-intensive organisations offer examples of relatively small but clean multilingual expertise data, covering broad ranges of expertise areas. We first present two main expertise retrieval tasks, along with a set of baseline approaches based on generative language modeling, aimed at finding expertise relations between topics and people. For our experimental evaluation, we introduce (and release) a new test set based on a crawl of a university site. Using this test set, we conduct two series of experiments. The first is aimed at determining the effectiveness of baseline expertise retrieval methods applied to the new test set. The second is aimed at assessing refined models that exploit characteristic features of the new test set, such as the organizational structure of the university, and the hierarchical structure of the topics in the test set. Expertise retrieval models are shown to be robust with respect to environments smaller than the W3C collection, and current techniques appear to be generalizable to other settings.

Copyrights may apply

p. 559-566

Broder, Andrei, Fontoura, Marcus, Josifovski, Vanja and Riedel, Lance (2007): A semantic approach to contextual advertising. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 559-566. Available online

Contextual advertising or Context Match (CM) refers to the placement of commercial textual advertisements within the content of a generic web page, while Sponsored Search (SS) advertising consists in placing ads on result pages from a web search engine, with ads driven by the originating query. In CM there is usually an intermediary commercial ad-network entity in charge of optimizing the ad selection with the twin goal of increasing revenue (shared between the publisher and the ad-network) and improving the user experience. With these goals in mind it is preferable to have ads relevant to the page content, rather than generic ads. The SS market developed quicker than the CM market, and most textual ads are still characterized by "bid phrases" representing those queries where the advertisers would like to have their ad displayed. Hence, the first technologies for CM have relied on previous solutions for SS, by simply extracting one or more phrases from the given page content, and displaying ads corresponding to searches on these phrases, in a purely syntactic approach. However, due to the vagaries of phrase extraction, and the lack of context, this approach leads to many irrelevant ads. To overcome this problem, we propose a system for contextual ad matching based on a combination of semantic and syntactic features.

Copyrights may apply

p. 567-574

Huffman, Scott B. and Hochster, Michael (2007): How well does result relevance predict session satisfaction?. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 567-574. Available online

Per-query relevance measures provide standardized, repeatable measurements of search result quality, but they ignore much of what users actually experience in a full search session. This paper examines how well we can approximate a user's ultimate session-level satisfaction using a simple relevance metric. We find that this relationship is surprisingly strong. By incorporating additional properties of the query itself, we construct a model which predicts user satisfaction even more accurately than relevance alone.

Copyrights may apply

p. 575-582

Custis, Tonya and Al-Kofahi, Khalid (2007): A new approach for evaluating query expansion: query-document term mismatch. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 575-582. Available online

The effectiveness of information retrieval (IR) systems is influenced by the degree of term overlap between user queries and relevant documents. Query-document term mismatch, whether partial or total, is a fact that must be dealt with by IR systems. Query Expansion (QE) is one method for dealing with term mismatch. IR systems implementing query expansion are typically evaluated by executing each query twice, with and without query expansion, and then comparing the two result sets. While this measures an overall change in performance, it does not directly measure the effectiveness of IR systems in overcoming the inherent issue of term mismatch between the query and relevant documents, nor does it provide any insight into how such systems would behave in the presence of query-document term mismatch. In this paper, we propose a new approach for evaluating query expansion techniques. The proposed approach is attractive because it provides an estimate of system performance under varying degrees of query-document term mismatch, it makes use of readily available test collections, and it does not require any additional relevance judgments or any form of manual processing.

Copyrights may apply

p. 583-590

Diaz, Fernando (2007): Performance prediction using spatial autocorrelation. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 583-590. Available online

Evaluation of information retrieval systems is one of the core tasks in information retrieval. Problems include the inability to exhaustively label all documents for a topic, generalizability from a small number of topics, and incorporating the variability of retrieval systems. Previous work addresses the evaluation of systems, the ranking of queries by difficulty, and the ranking of individual retrievals by performance. Approaches exist for the case of few and even no relevance judgments. Our focus is on zero-judgment performance prediction of individual retrievals. One common shortcoming of previous techniques is the assumption of uncorrelated document scores and judgments. If documents are embedded in a high-dimensional space (as they often are), we can apply techniques from spatial data analysis to detect correlations between document scores. We find that the low correlation between scores of topically close documents often implies a poor retrieval performance. When compared to a state of the art baseline, we demonstrate that the spatial analysis of retrieval scores provides significantly better prediction performance. These new predictors can also be incorporated with classic predictors to improve performance further. We also describe the first large-scale experiment to evaluate zero-judgment performance prediction for a massive number of retrieval systems over a variety collections in several languages.

Copyrights may apply

p. 591-598

Farah, Mohamed and Vanderpooten, Daniel (2007): An outranking approach for rank aggregation in information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 591-598. Available online

Research in Information Retrieval usually shows performance improvement when many sources of evidence are combined to produce a ranking of documents (e.g., texts, pictures, sounds, etc.). In this paper, we focus on the rank aggregation problem, also called data fusion problem, where rankings of documents, searched into the same collection and provided by multiple methods, are combined in order to produce a new ranking. In this context, we propose a rank aggregation method within a multiple criteria framework using aggregation mechanisms based on decision rules identifying positive and negative reasons for judging whether a document should get a better rank than another. We show that the proposed method deals well with the Information Retrieval distinctive features. Experimental results are reported showing that the suggested method performs better than the well-known CombSUM and CombMNZ operators.

Copyrights may apply

p. 599-606

Troy, Adam D. and Zhang, Guo-Qiang (2007): Enhancing relevance scoring with chronological term rank. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 599-606. Available online

We introduce a new relevance scoring technique that enhances existing relevance scoring schemes with term position information. This technique uses chronological term rank (CTR) which captures the positions of terms as they occur in the sequence of words in a document. CTR is both conceptually and computationally simple when compared to other approaches that use document structure information, such as term proximity, term order and document features. CTR works well when paired with Okapi BM25. We evaluate the performance of various combinations of CTR with Okapi BM25 in order to identify the most effective formula. We then compare the performance of the selected approach against the performance of existing methods such as Okapi BM25, pivoted length normalization and language models. Significant improvements are seen consistently across a variety of TREC data and topic sets, measured by the major retrieval performance metrics. This seems to be the first use of this statistic for relevance scoring. There is likely to be greater retrieval improvements possible using chronological term rank enhanced methods in future work.

Copyrights may apply

p. 607-614

Liu, Yang, Huang, Xiangji, An, Aijun and Yu, Xiaohui (2007): ARSA: a sentiment-aware model for predicting sales performance using blogs. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 607-614. Available online

Due to its high popularity, Weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public's sentiments and opinions. In this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. Based on an analysis of the complex nature of sentiments, we propose Sentiment PLSA (S-PLSA), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. Training an S-PLSA model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. We then present ARSA, an autoregressive sentiment-aware model, to utilize the sentiment information captured by S-PLSA for predicting product sales performance. Extensive experiments were conducted on a movie data set. We compare ARSA with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. Experiments confirm the effectiveness and superiority of the proposed approach.

Copyrights may apply

p. 615-622

Mamou, Jonathan, Ramabhadran, Bhuvana and Siohan, Olivier (2007): Vocabulary independent spoken term detection. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 615-622. Available online

We are interested in retrieving information from speech data like broadcast news, telephone conversations and roundtable meetings. Today, most systems use large vocabulary continuous speech recognition tools to produce word transcripts; the transcripts are indexed and query terms are retrieved from the index. However, query terms that are not part of the recognizer's vocabulary cannot be retrieved, and the recall of the search is affected. In addition to the output word transcript, advanced systems provide also phonetic transcripts, against which query terms can be matched phonetically. Such phonetic transcripts suffer from lower accuracy and cannot be an alternative to word transcripts. We present a vocabulary independent system that can handle arbitrary queries, exploiting the information provided by having both word transcripts and phonetic transcripts. A speech recognizer generates word confusion networks and phonetic lattices. The transcripts are indexed for query processing and ranking purpose. The value of the proposed method is demonstrated by the relative high performance of our system, which received the highest overall ranking for US English speech data in the recent NIST Spoken Term Detection evaluation.

Copyrights may apply

p. 623-630

Olsson, J. Scott and Oard, Douglas W. (2007): Improving text classification for oral history archives with temporal domain knowledge. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 623-630. Available online

This paper describes two new techniques for increasing the accuracy of topic label assignment to conversational speech from oral history interviews using supervised machine learning in conjunction with automatic speech recognition. The first, time-shifted classification, leverages local sequence information from the order in which the story is told. The second, temporal label weighting, takes the complementary perspective by using the position within an interview to bias label assignment probabilities. These methods, when used in combination, yield between 6% and 15% relative improvements in classification accuracy using a clipped R-precision measure that models the utility of label sets as segment summaries in interactive speech retrieval applications.

Copyrights may apply

p. 63-70

Büttcher, Stefan, Clarke, Charles L. A., Yeung, Peter C. K. and Soboroff, Ian (2007): Reliable information retrieval evaluation with incomplete and biased judgements. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 63-70. Available online

Information retrieval evaluation based on the pooling method is inherently biased against systems that did not contribute to the pool of judged documents. This may distort the results obtained about the relative quality of the systems evaluated and thus lead to incorrect conclusions about the performance of a particular ranking technique. We examine the magnitude of this effect and explore how it can be countered by automatically building an unbiased set of judgements from the original, biased judgements obtained through pooling. We compare the performance of this method with other approaches to the problem of incomplete judgements, such as bpref, and show that the proposed method leads to higher evaluation accuracy, especially if the set of manual judgements is rich in documents, but highly biased against some systems.

Copyrights may apply

p. 631-638

Turunen, Ville T. and Kurimo, Mikko (2007): Indexing confusion networks for morph-based spoken document retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 631-638. Available online

In this paper, we investigate methods for improving the performance of morph-based spoken document retrieval in Finnish by extracting relevant index terms from confusion networks. Our approach uses morpheme-like subword units ("morphs") for recognition and indexing. This alleviates the problem of out-of-vocabulary words, especially with inflectional languages like Finnish. Confusion networks offer a convenient representation of alternative recognition candidates by aligning mutually exclusive terms and by giving the posterior probability of each term. The rank of the competing terms and their posterior probability is used to estimate term frequency for indexing. Comparing against 1-best recognizer transcripts, we show that retrieval effectiveness is significantly improved. Finally, the effect of pruning in recognition is analyzed, showing that when recognition speed is increased, the reduction in retrieval performance due to the increase in the 1-best error rate can be compensated by using confusion networks.

Copyrights may apply

p. 639-646

Peng, Fuchun, Ahmed, Nawaaz, Li, Xin and Lu, Yumao (2007): Context sensitive stemming for web search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 639-646. Available online

Traditionally, stemming has been applied to Information Retrieval tasks by transforming words in documents to the their root form before indexing, and applying a similar transformation to query terms. Although it increases recall, this naive strategy does not work well for Web Search since it lowers precision and requires a significant amount of additional computation. In this paper, we propose a context sensitive stemming method that addresses these two issues. Two unique properties make our approach feasible for Web Search. First, based on statistical language modeling, we perform context sensitive analysis on the query side. We accurately predict which of its morphological variants is useful to expand a query term with before submitting the query to the search engine. This dramatically reduces the number of bad expansions, which in turn reduces the cost of additional computation and improves the precision at the same time. Second, our approach performs a context sensitive document matching for those expanded variants. This conservative strategy serves as a safeguard against spurious stemming, and it turns out to be very important for improving precision. Using word pluralization handling as an example of our stemming approach, our experiments on a major Web search engine show that stemming only 29% of the query traffic, we can improve relevance as measured by average Discounted Cumulative Gain (DCG5) by 6.1% on these queries and 1.8% over all query traffic.

Copyrights may apply

p. 647-654

Li, Wenjie, Qian, Donglei, Lu, Qin and Yuan, Chunfa (2007): Detecting, categorizing and clustering entity mentions in Chinese text. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 647-654. Available online

The work presented in this paper is motivated by the practical need for content extraction, and the available data source and evaluation benchmark from the ACE program. The Chinese Entity Detection and Recognition (EDR) task is of particular interest to us. This task presents us several language-independent and language-dependent challenges, e.g. rising from the complication of extraction targets and the problem of word segmentation, etc. In this paper, we propose a novel solution to alleviate the problems special in the task. Mention detection takes advantages of machine learning approaches and character-based models. It manipulates different types of entities being mentioned and different constitution units (i.e. extents and heads) separately. Mentions referring to the same entity are linked together by integrating most-specific-first and closest-first rule based pairwise clustering algorithms. Types of mentions and entities are determined by head-driven classification approaches. The implemented system achieves ACE value of 66.1 when evaluated on the EDR 2005 Chinese corpus, which has been one of the top-tier results. Alternative approaches to mention detection and clustering are also discussed and analyzed.

Copyrights may apply

p. 655-662

Zhou, Wei, Yu, Clement, Smalheiser, Neil, Torvik, Vetle and Hong, Jie (2007): Knowledge-intensive conceptual retrieval and passage extraction of biomedical literature. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 655-662. Available online

This paper presents a study of incorporating domain-specific knowledge (i.e., information about concepts and relationships between concepts in a certain domain) in an information retrieval (IR) system to improve its effectiveness in retrieving biomedical literature. The effects of different types of domain-specific knowledge in performance contribution are examined. Based on the TREC platform, we show that appropriate use of domain-specific knowledge in a proposed conceptual retrieval model yields about 23% improvement over the best reported result in passage retrieval in the Genomics Track of TREC 2006.

Copyrights may apply

p. 663-670

Chaudhuri, Surajit, Church, Kenneth, König, Arnd Christian and Sui, Liying (2007): Heavy-tailed distributions and multi-keyword queries. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 663-670. Available online

Intersecting inverted indexes is a fundamental operation for many applications in information retrieval and databases. Efficient indexing for this operation is known to be a hard problem for arbitrary data distributions. However, text corpora used in Information Retrieval applications often have convenient power-law constraints (also known as Zipf's Law and long tails) that allow us to materialize carefully chosen combinations of multi-keyword indexes, which significantly improve worst-case performance without requiring excessive storage. These multi-keyword indexes limit the number of postings accessed when computing arbitrary index intersections. Our evaluation on an e-commerce collection of 20 million products shows that the indexes of up to four arbitrary keywords can be intersected while accessing less than 20% of the postings in the largest single-keyword index.

Copyrights may apply

p. 671-678

Bast, Holger, Chitea, Alexandru, Suchanek, Fabian and Weber, Ingmar (2007): ESTER: efficient search on text, entities, and relations. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 671-678. Available online

We present ESTER, a modular and highly efficient system for combined full-text and ontology search. ESTER builds on a query engine that supports two basic operations: prefix search and join. Both of these can be implemented very efficiently with a compact index, yet in combination provide powerful querying capabilities. We show how ESTER can answer basic SPARQL graph-pattern queries on the ontology by reducing them to a small number of these two basic operations. ESTER further supports a natural blend of such semantic queries with ordinary full-text queries. Moreover, the prefix search operation allows for a fully interactive and proactive user interface, which after every keystroke suggests to the user possible semantic interpretations of his or her query, and speculatively executes the most likely of these interpretations. As a proof of concept, we applied ESTER to the English Wikipedia, which contains about 3 million documents, combined with the recent YAGO ontology, which contains about 2.5 million facts. For a variety of complex queries, ESTER achieves worst-case query processing times of a fraction of a second, on a single machine, with an index size of about 4 GB.

Copyrights may apply

p. 679-686

Skobeltsyn, Gleb, Luu, Toan, Zarko, Ivana Podnar, Rajman, Martin and Aberer, Karl (2007): Web text retrieval with a P2P query-driven index. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 679-686. Available online

In this paper, we present a query-driven indexing/retrieval strategy for efficient full text retrieval from large document collections distributed within a structured P2P network. Our indexing strategy is based on two important properties: (1) the generated distributed index stores posting lists for carefully chosen indexing term combinations, and (2) the posting lists containing too many document references are truncated to a bounded number of their top-ranked elements. These two properties guarantee acceptable storage and bandwidth requirements, essentially because the number of indexing term combinations remains scalable and the transmitted posting lists never exceed a constant size. However, as the number of generated term combinations can still become quite large, we also use term statistics extracted from available query logs to index only such combinations that are frequently present in user queries. Thus, by avoiding the generation of superfluous indexing term combinations, we achieve an additional substantial reduction in bandwidth and storage consumption. As a result, the generated distributed index corresponds to a constantly evolving query-driven indexing structure that efficiently follows current information needs of the users. More precisely, our theoretical analysis and experimental results indicate that, at the price of a marginal loss in retrieval quality for rare queries, the generated index size and network traffic remain manageable even for web-size document collections. Furthermore, our experiments show that at the same time the achieved retrieval quality is fully comparable to the one obtained with a state-of-the-art centralized query engine.

Copyrights may apply

p. 687-688

Metzler, Donald (2007): Using gradient descent to optimize language modeling smoothing parameters. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 687-688. Available online

p. 689-690

Hu, Jiani, Deng, Weihong, Guo, Jun and Xu, Weiran (2007): Locality discriminating indexing for document classification. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 689-690. Available online

This paper introduces a locality discriminating indexing (LDI) algorithm for document classification. Based on the hypothesis that samples from different classes reside in class-specific manifold structures, LDI seeks for a projection which best preserves the within-class local structures while suppresses the between-class overlap. Comparative experiments show that the proposed method is able to derives compact discriminating document representations for classification.

Copyrights may apply

p. 691-692

Kettunen, Kimmo (2007): Management of keyword variation with frequency based generation of word forms in IR. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 691-692. Available online

This paper presents a new management method for morphological variation of keywords. The method is called FCG, Frequent Case Generation. It is based on the skewed distributions of word forms in natural languages and is suitable for languages that have either fair amount of morphological variation or are morphologically very rich. The proposed method has been evaluated so far with four languages, Finnish, Swedish, German and Russian, which show varying degrees of morphological complexity.

Copyrights may apply

p. 693-694

Wan, Xiaojun (2007): OMES: a new evaluation strategy using optimal matching for document clustering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 693-694. Available online

Existing measures for evaluating clustering results (e.g. F-measure) have the limitation of overestimating cluster quality because they usually adopt the greedy matching between classes (reference clusters) and clusters (system clusters) to allow multiple classes to correspond to one same cluster, which is in fact a locally optimal solution. This paper proposes a new evaluation strategy to overcome the limitation of existing evaluation measures by using optimal matching in graph theory. A weighted bipartite graph is built with classes and clusters as two disjoint sets of vertices and the edge weight between any class and any cluster is computed using a basic metric. Then the total weight of the optimal matching in the graph is acquired and we use it to evaluate the quality of the clusters. The optimal matching allows only one-to-one matching between classes and clusters and a globally optimal solution can be achieved. A preliminary study is performed to demonstrate the effectiveness of the proposed evaluation strategy.

Copyrights may apply

p. 695-696

Maisonnasse, Loïc, Gaussier, Eric and Chevallet, Jean-Pierre (2007): Revisiting the dependence language model for information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 695-696. Available online

In this paper, we revisit the dependence language model for information retrieval proposed in [1], and show that this model is deficient from a theoretical point of view. We then propose a new model, well founded theoretically, for integrating dependencies between terms in the language model. This new model is simpler, yet more general, than the one proposed in [1], and yields similar results in our experiments, on both syntactic and semantic dependencies.

Copyrights may apply

p. 697-698

Qiu, Guang, Liu, Kangmiao, Bu, Jiajun, Chen, Chun and Kang, Zhiming (2007): Quantify query ambiguity using ODP metadata. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 697-698. Available online

Query ambiguity prevents existing retrieval systems from returning reasonable results for every query. As there is already lots of work done on resolving ambiguity, vague queries could be handled using corresponding approaches separately if they can be identified in advance. Quantification of the degree of (lack of) ambiguity lays the groundwork for the identification. In this poster, we propose such a measure using query topics based on the topic structure selected from the Open Directory Project (ODP) taxonomy. We introduce clarity score to quantify the lack of ambiguity with respect to data sets constructed from the TREC collections and the rank correlation test results demonstrate a strong positive association between the clarity scores and retrieval precisions for queries.

Copyrights may apply

p. 699-700

Tan, Songbo and Wang, Yuefen (2007): Combining error-correcting output codes and model-refinement for text categorization. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 699-700. Available online

In this work, we explore the use of error-correcting output codes (ECOC) to enhance the performance of centroid text classifier. The framework is to decompose one multi-class problem into multiple binary problems and then learn the individual binary classification problems by centroid classifier. However, this kind of decomposition incurs considerable bias for centroid classifier, which results in noticeable degradation of performance. To address this issue, we use Model-Refinement to adjust this so-called bias.

Copyrights may apply

p. 7-14

Chirita, Paul-Alexandru, Firan, Claudiu S. and Nejdl, Wolfgang (2007): Personalized query expansion for the web. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 7-14. Available online

The inherent ambiguity of short keyword queries demands for enhanced methods for Web retrieval. In this paper we propose to improve such Web queries by expanding them with terms collected from each user's Personal Information Repository, thus implicitly personalizing the search output. We introduce five broad techniques for generating the additional query keywords by analyzing user data at increasing granularity levels, ranging from term and compound level analysis up to global co-occurrence statistics, as well as to using external thesauri. Our extensive empirical analysis under four different scenarios shows some of these approaches to perform very well, especially on ambiguous queries, producing a very strong increase in the quality of the output rankings. Subsequently, we move this personalized search framework one step further and propose to make the expansion process adaptive to various features of each query. A separate set of experiments indicates the adaptive algorithms to bring an additional statistically significant improvement over the best static expansion approach.

Copyrights may apply

p. 701-702

Franz, Martin, McCarley, J. Scott and Xu, Jian-Ming (2007): User-oriented text segmentation evaluation measure. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 701-702. Available online

The paper describes a user oriented performance evaluation measure for text segmentation. Experiments show that the proposed measure differentiates well between error distributions with varying user impact.

Copyrights may apply

p. 703-704

Franz, Martin and Xu, Jian-Ming (2007): Story segmentation of broadcast news in Arabic, Chinese and English using multi-window features. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 703-704. Available online

The paper describes a maximum entropy based story segmentation system for Arabic, Chinese and English. In experiments with broadcast news data from TDT-3, TDT-4, and corpora collected in the DARPA GALE project we obtain a substantial performance gain using multiple overlapping windows for text-based features.

Copyrights may apply

p. 705-706

Strohman, Trevor, Croft, W. Bruce and Jensen, David (2007): Recommending citations for academic papers. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 705-706. Available online

We approach the problem of academic literature search by considering an unpublished manuscript as a query to a search system. We use the text of previous literature as well as the citation graph that connects it to find relevant related material. We evaluate our technique with manual and automatic evaluation methods, and find an order of magnitude improvement in mean average precision as compared to a text similarity baseline.

Copyrights may apply

p. 707-708

Cetintas, Suleyman and Si, Luo (2007): Exploration of the tradeoff between effectiveness and efficiency for results merging in federated search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 707-708. Available online

Federated search is the task of retrieving relevant documents from different information resources. One of the main research problems in federated search is to combine the results from different sources into a single ranked list. Recent work proposed a regression based method to download some documents from each ranked list of the different sources, calculated comparable scores for the documents and estimated mapping functions that transform source-specific scores into comparable scores. Experiments have shown that downloading more documents improves the accuracy of results merging. However downloading more documents increases the computation and communication costs. This paper proposes a utility based optimization method that enables the system to automatically decide on the desired number of training documents to download according to the user's need for effectiveness and efficiency.

Copyrights may apply

p. 709-710

Phan, Nina, Bailey, Peter and Wilkinson, Ross (2007): Understanding the relationship of information need specificity to search query length. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 709-710. Available online

When searching, people's information needs flow through to expressing an information retrieval request posed to a search engine. We hypothesise that the degree of specificity of an IR request might correspond to the length of a search query. Our results show a strong correlation between decreasing query length and increasing broadness or generality of the IR request. We found an average cross-over point of specificity from broad to narrow of 3 words in the query. These results have implications for search engines in responding to queries of differing lengths.

Copyrights may apply

p. 71-78

Sakai, Tetsuya (2007): Alternatives to Bpref. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 71-78. Available online

Recently, a number of TREC tracks have adopted a retrieval effectiveness metric called bpref which has been designed for evaluation environments with incomplete relevance data. A graded-relevance version of this metric called rpref has also been proposed. However, we show that the application of Q-measure, normalised Discounted Cumulative Gain (nDCG) or Average Precision (AveP) to condensed lists, obtained by filtering out all unjudged documents from the original ranked lists, is actually a better solution to the incompleteness problem than bpref. Furthermore, we show that the use of graded relevance boosts the robustness of IR evaluation to incompleteness and therefore that Q-measure and nDCG based on condensed lists are the best choices. To this end, we use four graded-relevance test collections from NTCIR to compare ten different IR metrics in terms of system ranking stability and pairwise discriminative power.

Copyrights may apply

p. 711-712

Ko, Youngjoong, An, Hongkuk and Seo, Jungyun (2007): An effective snippet generation method using the pseudo relevance feedback technique. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 711-712. Available online

A (page or web) snippet is document excerpts allowing a user to understand if a document is indeed relevant without accessing it. This paper proposes an effective snippet generation method. The pseudo relevance feedback technique and text summarization techniques are applied to salient sentences extraction for generating good quality snippets. In the experimental results, the proposed method showed much better performance than other methods including Google and Naver.

Copyrights may apply

p. 713-714

Wu, H. C., Luk, Robert W. P. and Wong, K. F. (2007): Probability ranking principle via optimal expected rank. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 713-714. Available online

This paper presents a new perspective of the probability ranking principle (PRP) by defining retrieval effectiveness in terms of our novel expected rank measure of a set of documents for a particular query. This perspective is based on preserving decision preferences, and it imposes weaker conditions on PRP than the utility-theoretic perspective of PRP.

Copyrights may apply

p. 715-716

Wiegand, Michael, Leidner, Jochen L. and Klakow, Dietrich (2007): Combining term-based and event-based matching for question answering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 715-716. Available online

In question answering, two main kinds of matching methods for finding answer sentences for a question are term-based approaches -- which are simple, efficient, effective, and yield high recall -- and event-based approaches that take syntactic and semantic information into account. The latter often sacrifice recall for increased precision, but actually capture the meaning of the events denoted by the textual units of a passage or sentence. We propose a robust, data-driven method that learns the mapping between questions and answers using logistic regression and show that combining term-based and event-based approaches significantly outperforms the individual methods.

Copyrights may apply

p. 717-718

Gyllstrom, Karl Anders, Soules, Craig and Veitch, Alistair (2007): Confluence: enhancing contextual desktop search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 717-718. Available online

We present Confluence, an enhancement to a desktop file search tool called Confluence which extracts conceptual relationships between files by their temporal access patterns in the file system. A limitation of a purely file-based approach is that as file operations are increasingly abstracted by applications, their correlation to a user's activity weakens and thereby reduces the applicability of their temporal patterns. To deal with this problem, we augment the file event stream with a stream of window focus events from the UI layer. We present 3 algorithms that analyze this new stream, extracting the user's task information which informs the existing Confluence algorithms. We present results and conclusions from a preliminary user study on Confluence.

Copyrights may apply

p. 719-720

Thomas, Paul and Rowlands, Tom (2007): Estimating the value of automatic disambiguation. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 719-720. Available online

A common motivation for personalised search systems is the ability to disambiguate queries based on some knowledge of a user's interests. An analysis of log files from three search providers, covering a range of scenarios, suggests that this sort of disambiguation would be of marginal use for more specialised providers but may be of use for whole-of-Web search.

Copyrights may apply

p. 721-722

Kumaran, A. and Kellner, Tobias (2007): A generic framework for machine transliteration. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 721-722. Available online

p. 723-724

Kamps, Jaap, Koolen, Marijn and Lalmas, Mounia (2007): Where to start reading a textual XML document?. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 723-724. Available online

In structured information retrieval, the aim is to exploit document structure to retrieve relevant components, allowing the user to go straight to the relevant material. This paper looks at the so-called best entry points (BEPs), which are intended to give the user the best starting point to access the relevant information in the document. We examine the relationship between BEPs and relevant components in the INEX 2006 ad hoc assessments. Our main findings are the following: First, although documents are short, assessors often choose the best entry point some distance from the start of the document. Second, many of the best entry points coincide with the first relevant character in relevant documents, showing a strong relation between the BEP and relevant text. Third, we find browsing BEPs in articles with a single relevant passages, and container BEPs or context BEPs in articles with more relevant passages.

Copyrights may apply

p. 725-726

Fernández, Ronald T. and Losada, David E. (2007): Novelty detection using local context analysis. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 725-726. Available online

p. 727-728

Ruthven, Ian, Glasgow, Leif Azzopardi, Baillie, Mark, Bierig, Ralf, Nicol, Emma, Sweeney, Simon and Yakici, Murat (2007): Intra-assessor consistency in question answering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 727-728. Available online

In this paper we investigate the consistency of answer assessment in a complex question answering task examining features of assessor consistency, types of answers and question type.

Copyrights may apply

p. 729-730

Winaver, Mattan, Kurland, Oren and Domshlak, Carmel (2007): Towards robust query expansion: model selection in the language modeling framework. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 729-730. Available online

We propose a language-model-based approach for addressing the performance robustness problem -- with respect to free-parameters' values -- of pseudo-feedback-based query-expansion methods. Given a query, we create a set of language models representing different forms of its expansion by varying the parameters' values of some expansion method; then, we select a single model using criteria originally proposed for evaluating the performance of using the original query, or for deciding whether to employ expansion at all. Experimental results show that these criteria are highly effective in selecting relevance language models that are not only significantly more effective than poor performing ones, but that also yield performance that is almost indistinguishable from that of manually optimized relevance models.

Copyrights may apply

p. 731-732

Staff, Chris and Bugeja, Ian (2007): Automatic classification of web pages into bookmark categories. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 731-732. Available online

We describe a technique to automatically classify a web page into an existing bookmark category to help a user to bookmark a page. HyperBK compares a bag-of-words representation of the page to descriptions of categories in the user's bookmark file. Unlike default web browser dialog boxes in which the user may be presented with the category into which he or she saved the last bookmarked file, HyperBK also offers the category most similar to the page being bookmarked. The user can also opt to create a new category; or save the page elsewhere. In an evaluation, the user's preferred category was offered on average 61% of the time.

Copyrights may apply

p. 733-734

Lin, Kevin Hsin-Yih, Yang, Changhua and Chen, Hsinchun (2007): What emotions do news articles trigger in their readers?. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 733-734. Available online

We study the classification of news articles into emotions they invoke in their readers. Our work differs from previous studies, which focused on the classification of documents into their authors' emotions instead of the readers'. We use various combinations of feature sets to find the best combination for identifying the emotional influences of news articles on readers.

Copyrights may apply

p. 735-736

Verberne, Suzan, Boves, Lou, Oostdijk, Nelleke and Coppen, Peter-Arno (2007): Evaluating discourse-based answer extraction for why-question answering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 735-736. Available online

p. 737-738

Sitbon, Laurianne and Bellot, Patrice (2007): Topic segmentation using weighted lexical links (WLL). In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 737-738. Available online

This paper presents two new approaches of lexical chains for topic segmentation using weighted lexical chains (WLC) or weighted lexical links (WLL) between repeated occurrences of lemmas along the text. The main advantage of using these new approaches is the suppression of the empirical parameter called hiatus in lexical chain processing. An evaluation according to the WindowDiff measure on a large automatically built corpus shows slight improvements in WLL compared to state-of-the-art methods based on lexical chains.

Copyrights may apply

p. 739-740

Bozzon, Alessandro, Chirita, Paul-Alexandru, Firan, Claudiu S. and Nejdl, Wolfgang (2007): Lexical analysis for modeling web query reformulation. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 739-740. Available online

Modeling Web query reformulation processes is still an unsolved problem. In this paper we argue that lexical analysis is highly beneficial for this purpose. We propose to use the variation in Query Clarity, as well as the Part-Of-Speech pattern transitions as indicators of user's search actions. Experiments with a log of 2.4 million queries showed our techniques to be more flexible than the current approaches, while also providing us with interesting insights into user's Web behavioral patterns.

Copyrights may apply

p. 741-742

Lalmas, Mounia, Bhat, Ramnath, Frank, Maxine, Frohlich, David M. and Jones, Matt (2007): Bridging the digital divide: understanding information access practices in an indian village community. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 741-742. Available online

For digital library and information retrieval technologies to provide solutions for bridging the digital divide in developing countries, we need to understand the information access practices of remote and often poor communities in these countries. We must understand the information needs of these communities, and the best means to provide them access to relevant information. To this end, we investigated the current information access practices in an Indian village.

Copyrights may apply

p. 743-744

Sevillano, Xavier, Alias, Francesc and Socoro, Joan Claudi (2007): BordaConsensus: a new consensus function for soft cluster ensembles. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 743-744. Available online

Consensus clustering is the task of deriving a single labeling by applying a consensus function on a cluster ensemble. This work introduces BordaConsensus, a new consensus function for soft cluster ensembles based on the Borda voting scheme. In contrast to classic, hard consensus functions that operate on labelings, our proposal considers cluster membership information, thus being able to tackle multiclass clustering problems. Initial small scale experiments reveal that, compared to state-of-the-art consensus functions, BordaConsensus constitutes a good performance vs. complexity trade-off.

Copyrights may apply

p. 745-746

Bordogna, Gloria, Ghilardi, Luca, Milesi, Simone and Pagani, Marco (2007): A flexible retrieval system of shapes in binary images. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 745-746. Available online

This poster overviews the main characteristics of a flexible retrieval systems of shapes present in binary images and discusses some evaluation results. The system applies multiple indexing criteria of the shapes synthesizing distinct characteristics such as global features of the objects contour (Fourier Coefficients), boundary irregularities (Multifractal Spectrum), presence of concavities and convexities on the boundary (Contour Scale Space distribution). The system is flexible since it allows customizing the retrieval function to fit an application need. The query is a binary image containing the desired shape and a set of parameters specifying the distinct importance of the shape characteristics that must be taken into account to evaluate the relevance of the retrieved shapes. The retrieval function is then defined as a Flexible Multicriteria fusion Function producing ranked results. The evaluation experiments showed that this system can be suited to different retrieval purposes, and that generally the combination of the distinct shape indexing criteria increases both Recall and Precision with respect to the application of any single indexing criterion alone.

Copyrights may apply

p. 747-748

Zhang, Yi and Liu, Bing (2007): Semantic text classification of disease reporting. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 747-748. Available online

Traditional text classification studied in the IR literature is mainly based on topics. That is, each class or category represents a particular topic, e.g., sports, politics or sciences. However, many real-world text classification problems require more refined classification based on some semantic aspects. For example, in a set of documents about a particular disease, some documents may report the outbreak of the disease, some may describe how to cure the disease, some may discuss how to prevent the disease, and yet some others may include all the above information. To classify text at this semantic level, the traditional "bag of words" model is no longer sufficient. In this paper, we report a text classification study at the semantic level and show that sentence semantic and structure features are very useful for such kind of classification. Our experimental results based on a disease outbreak dataset demonstrated the effectiveness of the proposed approach.

Copyrights may apply

p. 749-750

Kamps, Jaap, Lalmas, Mounia and Pehcevski, Jovan (2007): Evaluating relevant in context: document retrieval with a twist. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 749-750. Available online

The Relevant in Context retrieval task is document or article retrieval with a twist, where not only the relevant articles should be retrieved but also the relevant information within each article (captured by a set of XML elements) should be correctly identified. Our main research question is: how to evaluate the Relevant in Context task? We propose a generalized average precision measure that meets two main requirements: i) the score reflects the ranked list of articles inherent in the result list, and at the same time ii) the score also reflects how well the retrieved information per article (i.e., the set of elements) corresponds to the relevant information. The resulting measure was used at INEX 2006.

Copyrights may apply

p. 751-752

Lee, Lillian (2007): IDF revisited: a simple new derivation within the Robertson-Spärck Jones probabilistic model. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 751-752. Available online

There have been a number of prior attempts to theoretically justify the effectiveness of the inverse document frequency (IDF). Those that take as their starting point Robertson and Sparck Jones's probabilistic model are based on strong or complex assumptions. We show that a more intuitively plausible assumption suffices. Moreover, the new assumption, while conceptually very simple, provides a solution to an estimation problem that had been deemed intractable by Robertson and Walker (1997).

Copyrights may apply

p. 753-754

Cormack, Gordon V. and Lynam, Thomas R. (2007): Validity and power of t-test for comparing MAP and GMAP. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 753-754. Available online

We examine the validity and power of the t-test, Wilcoxon test, and sign test in determining whether or not the difference in performance between two IR systems is significant. Empirical tests conducted on subsets of the TREC2004 Robust Retrieval collection indicate that the p-values computed by these tests for the difference in mean average precision (MAP) between two systems are very accurate fora wide range of sample sizes and significance estimates. Similarly, these tests have good power, with the t-test proving superior overall. The t-test is also valid for comparing geometric mean average precision (GMAP), exhibiting slightly superior accuracy and slightly inferior power than for MAP comparison.

Copyrights may apply

p. 755-756

Efron, Miles (2007): Model-averaged latent semantic indexing. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 755-756. Available online

This poster introduces a novel approach to information retrieval that uses statistical model averaging to improve latent semantic indexing (LSI). Instead of choosing a single dimensionality $k$ for LSI , we propose using several models of differing dimensionality to inform retrieval. To manage this ensemble we weight each model's contribution to an extent inversely proportional to its AIC (Akaike information criterion). Thus each model contributes proportionally to its expected Kullback-Leibler divergence from the distribution that generated the data. We present results on three standard IR test collections, demonstrating significant improvement over both the traditional vector space model and single-model LSI.

Copyrights may apply

p. 757-758

Teevan, Jaime, Dumais, Susan and Horvitz, Eric (2007): Characterizing the value of personalizing search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 757-758. Available online

We investigate the diverse goals that people have when they issue the same query to a search engine, and the ability of current search engines to address such diversity. We quantify the potential value of personalizing search results based on this analysis. Great variance was found in the results that different individuals rated as relevant for the same query -- even when the same information goal was expressed. Our analysis suggests that while search engines do a good job of ranking results to maximize global happiness, they do not do a very good job for specific individuals.

Copyrights may apply

p. 759-760

Yeung, Peter C. K., Clarke, Charles L. A. and Büttcher, Stefan (2007): Improving retrieval accuracy by weighting document types with clickthrough data. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 759-760. Available online

For enterprise search, there exists a relationship between work task and document type that can be used to refine search results. In this poster, we adapt the popular Okapi BM25 scoring function to weight term frequency based on the relevance of a document type to a work task. Also, we use click frequency for each task-type pair to estimate a realistic weight. Using the W3C collection from the TREC Enterprise track for evaluations, our approach leads to significant improvements on search precision.

Copyrights may apply

p. 761-762

Jiang, Wei, Si, Luo and Li, Jing (2007): Protecting source privacy in federated search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 761-762. Available online

Many information sources contain information that can only be accessed through search-specific search engines. Federated search provides search solutions of this type of hidden information that cannot be searched by conventional search engines. In many scenarios of federated search, such as the search among health care providers or among intelligence agencies, an individual information source does not want to disclose the source of the search results to users or other sources. Therefore, this paper proposes a two-step federated search protocol that protects the privacy of information sources. As far as we know, this is the first attempt to address the research problem of protecting source privacy in federated text search.

Copyrights may apply

p. 763-764

Liao, Ciya and Chang, Thomas (2007): Applying ranking SVM in query relaxation. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 763-764. Available online

We propose an approach QRRS (Query Relaxative Ranking SVM) that divides a ranking function into different relaxation steps, so that only cheap features are used in Ranking SVM of early steps for query efficiency. We show search quality in the approach is improved compared to conventional Ranking SVM.

Copyrights may apply

p. 765-766

Xu, Jingfang and Li, Xing (2007): Learning to rank collections. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 765-766. Available online

Collection selection, ranking collections according to user query is crucial in distributed search. However, few features are used to rank collections in the current collection selection methods, while hundreds of features are exploited to rank web pages in web search. The lack of features affects the efficiency of collection selection in distributed search. In this paper, we exploit some new features and learn to rank collections with them through SVM and RankingSVM respectively. Experimental results show that our features are beneficial to collection selection, and the learned ranking functions outperform the classical CORI algorithm.

Copyrights may apply

p. 767-768

Mei, Tao, Yang, Bo, Hua, Xian-Sheng, Yang, Linjun, Yang, Shi-Qiang and Li, Shipeng (2007): VideoReach: an online video recommendation system. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 767-768. Available online

This paper presents a novel online video recommendation system called VideoReach, which alleviates users' efforts on finding the most relevant videos according to current viewings without a sufficient collection of user profiles as required in traditional recommenders. In this system, video recommendation is formulated as finding a list of relevant videos in terms of multimodal relevance (i.e. textual, visual, and aural relevance) and user click-through. Since different videos have different intra-weights of relevance within an individual modality and inter-weights among different modalities, we adopt relevance feedback to automatically find optimal weights by user click-though, as well as an attention fusion function to fuse multimodal relevance. We use 20 clips as the representative test videos, which are searched by top 10 queries from more than 13k online videos, and report superior performance compared with an existing video site.

Copyrights may apply

p. 769-770

Yakici, Murat, Baillie, Mark, Ruthven, Ian and Crestani, Fabio (2007): Modelling epistemic uncertainty in ir evaluation. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 769-770. Available online

Modern information retrieval (IR) test collections violate the completeness assumption of the Cranfield paradigm. In order to maximise the available resources, only a sample of documents (i.e. the pool) are judged for relevance by a human assessor(s). The subsequent evaluation protocol does not make any distinctions between assessed or unassessed documents, as documents that are not in the pool are assumed to be not relevant for the topic. This is beneficial from a practical point of view, as the relative performance can be compared with confidence if the experimental conditions are fair for all systems. However, given the incompleteness of relevance assessments, two forms of uncertainty emerge during evaluation. The first is Aleatory uncertainty, which refers to variation in system performance across the topic set, which is often addressed through the use of statistical significance tests. The second form of uncertainty is Epistemic, which refers to the amount of knowledge (or ignorance) we have about the estimate of a system's performance. Epistemic uncertainty is a consequence of incompleteness and is not addressed by the current evaluation protocol. In this study, we present a first attempt at modelling both aleatory and epistemic uncertainty associated with IR evaluation. We aim to account for both the variability associated with system performance and the amount of knowledge known about the performance estimate.

Copyrights may apply

p. 771-772

Schuldt, Arne, Gottfried, Björn, Osterhagen, Ole and Herzog, Otthein (2007): On the importance of preserving the part-order in shape retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 771-772. Available online

This paper discusses the importance of part-order-preservation in shape matching. A part descriptor is introduced that supports both preserving and abandoning the order of parts. The evaluation shows that retrieval results are improved by almost 38% if the original ordering is preserved.

Copyrights may apply

p. 773-774

Al-Maskari, Azzah, Sanderson, Mark and Clough, Paul (2007): The relationship between IR effectiveness measures and user satisfaction. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 773-774. Available online

This paper presents an experimental study of users assessing the quality of Google web search results. In particular we look at how users' satisfaction correlates with the effectiveness of Google as quantified by IR measures such as precision and the suite of Cumulative Gain measures (CG, DCG, NDCG). Results indicate strong correlation between users' satisfaction, CG and precision, moderate correlation with DCG, with perhaps surprisingly negligible correlation with NDCG. The reasons for the low correlation with NDCG are examined.

Copyrights may apply

p. 775-776

Pasi, Gabriella, Bordogna, Gloria and Villa, Robert (2007): A multi-criteria content-based filtering system. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 775-776. Available online

In this paper we present a novel filtering system, based on a new model which reshapes the aims of content-based filtering. The filtering system has been developed within the EC project PENG, aimed at providing news professionals, such as journalists, with a system supporting both filtering and retrieval capabilities. In particular, we suggest that in tackling the problem of information overload, it is necessary for filtering systems to take into account multiple aspects of incoming documents in order to estimate their relevance to a user's profile, and in order to help users better understand documents, as distinct from solely attempting to either select relevant material from a stream, or block inappropriate material. Aiming to so this, a filtering model based on multiple criteria has been defined, based on the ideas gleamed in the project requirements stage. The filtering model is briefly described in this paper.

Copyrights may apply

p. 777-778

Blanco, Roi and Barreiro, Alvaro (2007): Boosting static pruning of inverted files. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 777-778. Available online

This paper revisits the static term-based pruning technique presented in Carmel et al., SIGIR 2001 for ad-hoc retrieval, addressing different issues concerning its algorithmic design not yet taken into account. Although the original technique is able to retain precision when a considerable part of the inverted file is removed, we show that it is possible to improve precision in some scenarios if some key design features are properly selected.

Copyrights may apply

p. 779-780

Leidner, Jochen L. (2007): Resource monitoring in information extraction. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 779-780. Available online

It is often argued that in information extraction (IE), certain machine learning (ML) approaches save development time over others, or that certain ML methods (e.g. Active Learning) require less training data than others, thus saving development cost. However, such development cost claims are not normally backed up by controlled studies which show that such development cost savings actually occur. This situation in Language Engineering is contrasted with Software Engineering in general, where a lot of studies investigating system development cost have been carried out. We argue for the need of controlled studies that measure actual system development time in language engineering. To this end, we carry out an experiment in resource monitoring for an IE task: three named entity taggers for the same "surprise" domain are developed in parallel, using competing methods. Their human development time is accounted for using a logging facility. We report development cost results for parallel implementations of a named entity tagger and present a breakdown of the development time for the three alternative methods. We are not aware of detailed previous parallel studies that detail how system development time is spent when creating a named entity tagger.

Copyrights may apply

p. 781-782

Simeoni, Fabio, Crestani, Fabio and Bierig, Ralf (2007): The DILIGENT framework for distributed information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 781-782. Available online

p. 783-784

Beitzel, Steven M., Jensen, Eric C., Chowdhury, Abdur and Frieder, Ophir (2007): Varying approaches to topical web query classification. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 783-784. Available online

Topical classification of web queries has drawn recent interest because of the promise it offers in improving retrieval effectiveness and efficiency. However, much of this promise depends on whether classification is performed before or after the query is used to retrieve documents. We examine two previously unaddressed issues in query classification: pre versus post-retrieval classification effectiveness and the effect of training explicitly from classified queries versus bridging a classifier trained using a document taxonomy. Bridging classifiers map the categories of a document taxonomy onto those of a query classification problem to provide sufficient training data. We find that training classifiers explicitly from manually classified queries outperforms the bridged classifier by 48% in F1 score. Also, a pre-retrieval classifier using only the query terms performs merely 11% worse than the bridged classifier which requires snippets from retrieved documents.

Copyrights may apply

p. 785-786

Soboroff, Ian (2007): A comparison of pooled and sampled relevance judgments. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 785-786. Available online

Test collections are most useful when they are reusable, that is, when they can be reliably used to rank systems that did not contribute to the pools. Pooled relevance judgments for very large collections may not be reusable for two reasons: they will be very sparse and not sufficiently complete, and they may be biased in the sense that they will unfairly rank some class of systems. The TREC 2006 terabyte track judged both a pool and a deep random sample in order to measure the effects of sparseness and bias.

Copyrights may apply

p. 787-788

Banerjee, Somnath, Ramanathan, Krishnan and Gupta, Ajay (2007): Clustering short texts using Wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 787-788. Available online

Subscribers to the popular news or blog feeds (RSS/Atom) often face the problem of information overload as these feed sources usually deliver large number of items periodically. One solution to this problem could be clustering similar items in the feed reader to make the information more manageable for a user. Clustering items at the feed reader end is a challenging task as usually only a small part of the actual article is received through the feed. In this paper, we propose a method of improving the accuracy of clustering short texts by enriching their representation with additional features from Wikipedia. Empirical results indicate that this enriched representation of text items can substantially improve the clustering accuracy when compared to the conventional bag of words representation.

Copyrights may apply

p. 789-790

Xu, Jingfang, Wu, Sheng and Li, Xing (2007): Estimating collection size with logistic regression. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 789-790. Available online

Collection size is an important feature to represent the content summaries of a collection, and plays a vital role in collection selection for distributed search. In uncooperative environments, collection size estimation algorithms are adopted to estimate the sizes of collections with their search interfaces. This paper proposes heterogeneous capture (HC) algorithm, in which the capture probabilities of documents are modeled with logistic regression. With heterogeneous capture probabilities, HC algorithm estimates collection size through conditional maximum likelihood. Experimental results on real web data show that our HC algorithm outperforms both multiple capture-recapture and capture history algorithms.

Copyrights may apply

p. 79-86

Raghavan, Hema and Allan, James (2007): An interactive algorithm for asking and incorporating feature feedback into support vector machines. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 79-86. Available online

Standard machine learning techniques typically require ample training data in the form of labeled instances. In many situations it may be too tedious or costly to obtain sufficient labeled data for adequate classifier performance. However, in text classification, humans can easily guess the relevance of features, that is, words that are indicative of a topic, thereby enabling the classifier to focus its feature weights more appropriately in the absence of sufficient labeled data. We will describe an algorithm for tandem learning that begins with a couple of labeled instances, and then at each iteration recommends features and instances for a human to label. Tandem learning using an "oracle" results in much better performance than learning on only features or only instances. We find that humans can emulate the oracle to an extent that results in performance (accuracy) comparable to that of the oracle. Our unique experimental design helps factor out system error from human error, leading to a better understanding of when and why interactive feature selection works.

Copyrights may apply

p. 791-792

Haubold, Alexander (2007): Selection and ranking of text from highly imperfect transcripts for retrieval of video content. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 791-792. Available online

In the domain of video content retrieval, we present an approach for selecting words and phrases from highly imperfect automatically generated transcripts. Extracted terms are ranked according to their descriptiveness and presented to the user in a multimedia browser interface. We use sense querying from the WordNet lexical database for our method of text selection and ranking. Evaluation of 679 video summarization tasks from 442 users shows that the method of ranking and emphasizing terms according to descriptiveness results in higher accuracy responses in less time compared to the baseline of no ranking.

Copyrights may apply

p. 793-794

Fujii, Atsushi (2007): Enhancing patent retrieval by citation analysis. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 793-794. Available online

This paper proposes a method to combine text-based and citation-based retrieval methods in the invalidity patent search. Using the NTCIR-6 test collection including eight years of USPTO patents, we show the effectiveness of our method experimentally.

Copyrights may apply

p. 795-796

Cai, Keke, Chen, Chun, Liu, Kangmiao, Bu, Jiajun and Huang, Peng (2007): MRF based approach for sentence retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 795-796. Available online

This poster focuses on the study of term context dependence in the application of sentence retrieval. Based on Markov Random Field (MRF), three forms of dependence among query terms are considered. Under different assumptions of term dependence relationship, three feature functions are defined, with the purpose to utilize association features between query terms in sentence to evaluate the relevance of sentence. Experimental results have proven the efficiency of the proposed retrieval models in improving the performance of sentence retrieval.

Copyrights may apply

p. 797-798

Li, Yinghao, Luk, Wing Pong Robert, Ho, Kei Shiu Edward and Chung, Fu Lai Korris (2007): Improving weak ad-hoc queries using Wikipedia asexternal corpus. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 797-798. Available online

In an ad-hoc retrieval task, the query is usually short and the user expects to find the relevant documents in the first several result pages. We explored the possibilities of using Wikipedia's articles as an external corpus to expand ad-hoc queries. Results show promising improvements over measures that emphasize on weak queries.

Copyrights may apply

p. 799-800

Lee, Changki, Hwang, Yi-Gyu and Jang, Myung-Gil (2007): Fine-grained named entity recognition and relation extraction for question answering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 799-800. Available online

p. 801-802

Hagedorn, Bennett A., Ciaramita, Massimiliano and Atserias, Jordi (2007): World knowledge in broad-coverage information filtering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 801-802. Available online

p. 803-804

Trieschnigg, Dolf, Kraaij, Wessel and Jong, Franciska de (2007): The influence of basic tokenization on biomedical document retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 803-804. Available online

Tokenization is a fundamental preprocessing step in Information Retrieval systems in which text is turned into index terms. This paper quantifies and compares the influence of various simple tokenization techniques on document retrieval effectiveness in two domains: biomedicine and news. As expected, biomedical retrieval is more sensitive to small changes in the tokenization method. The tokenization strategy can make the difference between a mediocre and well performing IR system, especially in the biomedical domain.

Copyrights may apply

p. 805-806

Kyriakopoulou, Antonia and Kalamboukis, Theodore (2007): Using clustering to enhance text classification. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 805-806. Available online

This paper addresses the problem of learning to classify texts by exploiting information derived from clustering both training and testing sets. The incorporation of knowledge resulting from clustering into the feature space representation of the texts is expected to boost the performance of a classifier. Experiments conducted on several widely used datasets demonstrate the effectiveness of the proposed algorithm especially for small training sets.

Copyrights may apply

p. 807-808

Stepinski, Adam and Mittal, Vibhu O. (2007): A fact/opinion classifier for news articles. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 807-808. Available online

Many online news/blog aggregators like Google, Yahoo and MSN allow users to browse/search many hundreds of news sources. This results in dozens, often hundreds, of stories about the same event. While the news aggregators cluster these stories, allowing the user to efficiently scan the major news items at any given time, they do not currently allow alternative browsing mechanisms within the clusters. Furthermore, their intra-cluster ranking mechanisms are often based on a notion of authority/popularity of the source. In many cases, this leads to the classic power law phenomenon -- the popular stories/sources are the ones that are already popular/authoritative, thus reinforcing one dominant viewpoint. Ideally, these aggregators would exploit the availability of the tremendous number of sources to identify the various dominant threads or viewpoints about a story and highlight these threads for the users. This paper presents an initial limited approach to such an interface: it classifies articles into two categories: fact and opinion. We show that the combination of (i) a classifier trained on a small (140K) training set of editorials/reports and (ii) an interactive user interface that ameliorates classification errors by re-ordering the presentation can be effective in highlighting different underlying viewpoints in a story-cluster. We briefly discuss the classifier used here, the training set and the UI and report on some initial anecdotal user feedback and evaluation.

Copyrights may apply

p. 809-810

Yi, Xing, Allan, James and Croft, W. Bruce (2007): Matching resumes and jobs based on relevance models. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 809-810. Available online

We investigate the difficult problem of matching semi-structured resumes and jobs in a large scale real-world collection. We compare standard approaches to Structured Relevance Models (SRM), an extension of relevance-based language model for modeling and retrieving semi-structured documents. Preliminary experiments show that the SRM approach achieved promising performance and performed better than typical unstructured relevance models.

Copyrights may apply

p. 811-812

Ding, Xiaowen and Liu, Bing (2007): The utility of linguistic rules in opinion mining. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 811-812. Available online

Online product reviews are one of the important opinion sources on the Web. This paper studies the problem of determining the semantic orientations (positive or negative) of opinions expressed on product features in reviews. Most existing approaches use a set of opinion words for the purpose. However, the semantic orientations of many words are context dependent. In this paper, we propose to use some linguistic rules to deal with the problem together with a new opinion aggregation function. Extensive experiments show that these rules and the function are highly effective. A system, called Opinion Observer, has also been built.

Copyrights may apply

p. 813-814

Balasubramanian, Niranjan, Allan, James and Croft, W. Bruce (2007): A comparison of sentence retrieval techniques. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 813-814. Available online

Identifying redundant information in sentences is useful for several applications such as summarization, document provenance, detecting text reuse and novelty detection. The task of identifying redundant information in sentences is defined as follows: Given a query sentence the task is to retrieve sentences from a given collection that express all or some subset of the information present in the query sentence. Sentence retrieval techniques rank sentences based on some measure of their similarity to a query. The effectiveness of such techniques depends on the similarity measure used to rank sentences. An effective retrieval model should be able to handle low word overlap between query and candidate sentences and go beyond just word overlap. Simple language modeling techniques like query likelihood retrieval have outperformed TF-IDF and word overlap based methods for ranking sentences. In this paper, we compare the performance of sentence retrieval using different language modeling techniques for the problem of identifying redundant information.

Copyrights may apply

p. 815-816

Magalhaes, Joao and Rueger, Stefan (2007): High-dimensional visual vocabularies for image retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 815-816. Available online

In this paper we formulate image retrieval by text query as a vector space classification problem. This is achieved by creating a high-dimensional visual vocabulary that represents the image documents in great detail. We show how the representation of these image documents enables the application of well known text retrieval techniques such as Rocchio tf-idf and naive Bayes to the semantic image retrieval problem. We tested these methods on a Corel images subset and achieve state-of-the-art retrieval performance using the proposed methods.

Copyrights may apply

p. 817-818

Chibane, Idir and Doan, Bich-Lien (2007): A web page topic segmentation algorithm based on visual criteria and content layout. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 817-818. Available online

This paper presents experiments using an algorithm of web page topic segmentation that show significant precision improvement in the retrieval of documents issued from the Web track corpus of TREC 2001. Instead of processing the whole document, a web page is segmented into different semantic blocks according to visual criteria (such as horizontal lines, colors) and structural tags (such as headings

~

, paragraph

). We conclude that combining visual and content layout criteria gives the best results for increasing the precision: the ranking of the page is calculated for relevant segments of pages resulting from the segmentation algorithm.

Copyrights may apply

p. 819-820

Feng, Ao (2007): Document clustering: an optimization problem. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 819-820. Available online

Clustering algorithms have been widely used in information retrieval applications. However, it is difficult to define an objective "best" result. This article analyzes some document clustering algorithms and illustrates that they are equivalent to the optimization problem of some global functions. Experiments show their good performance, but there are still counter-examples where they fail to return the optimal solution. We argue that Monte-Carlo algorithms in the global optimization framework have the potential to find better solutions than traditional clustering, and they are able to handle more complex structures.

Copyrights may apply

p. 821-822

Balog, Krisztian and Rijke, Maarten de (2007): Finding similar experts. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 821-822. Available online

The task of finding people who are experts on a topic has recently received increased attention. We introduce a different expert finding task for which a small number of example experts is given (instead of a natural language query), and the system's task is to return similar experts. We define, compare, and evaluate a number of ways of representing experts, and investigate how the size of the initial example set affects performance. We show that more fine-grained representations of candidates result in higher performance, and larger sample sets as input lead to improved precision.

Copyrights may apply

p. 823-824

Ertekin, Seyda, Huang, Jian and Giles, Clyde Lee (2007): Active learning for class imbalance problem. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 823-824. Available online

The class imbalance problem has been known to hinder the learning performance of classification algorithms. Various real-world classification tasks such as text categorization suffer from this phenomenon. We demonstrate that active learning is capable of solving the problem.