Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval
Time and place:
The annual BCS-IRSG European Conference on Information Retrieval is the main European forum for the presentation of new research results in the field of Information Retrieval. The conference encourages the submission of high quality research papers reporting original, previously unpublished results.
The following articles are from "Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval":
Hollink, Vera, He, Jiyin and Vries, Arjen de (2012): Explaining query modifications: an alternative interpretation of term addition and removal. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 1-12. Available online
In the course of a search session, searchers often modify their queries several times. In most previous work analyzing search logs, the addition of terms to a query is identified with query specification and the removal of terms with query generalization. By analyzing the result sets that motivated searchers to make modifications, we show that this interpretation is not always correct. In fact, our experiments indicate that in the majority of cases the modifications have the opposite functions. Terms are often removed to get rid of irrelevant results matching only part of the query and thus to make the result set more specific. Similarly, terms are often added to retrieve more diverse results. We propose an alternative interpretation of term additions and removals and show that it explains the deviant modification behavior that was observed.
Wu, Hao and Fang, Hui (2012): Relation based term weighting regularization. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 109-120. Available online
Traditional retrieval models compute term weights based on only the information related to individual terms such as TF and IDF. However, query terms are related. Intuitively, these relations could provide useful information about the importance of a term in the context of other query terms. For example, query "perl tutorial" specifies that a user look for information relevant to both perl and tutorial. Thus, a document containing both terms should have higher relevance score than the ones with only one of them. However, if the IDF value of "tutorial" is much smaller than "perl", existing retrieval models may assign the document lower score than those containing multiple occurrences of "perl". It is clear that the importance of a term should be dependent on not only collection statistics but also the relations with other query terms. In this work, we study how to utilize semantic relations among query terms to regularize term weighting. Experiment results over TREC collections show that the proposed strategy is effective to improve the retrieval performance.
Yan, Zhenlei and Zhou, Jie (2012): A new approach to answerer recommendation in community question answering services. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 121-132. Available online
Community Question Answering (CQA) service which enables users to ask and answer questions have emerged popular on the web. However, lots of questions usually can't be resolved by appropriate answerers effectively. To address this problem, we present a novel approach to recommend users who are most likely to be able to answer the new question. Differently with previous methods, this approach utilizes the inherent semantic relations among asker-question-answerer simultaneously and perform the Answerer Recommendation task based on tensor factorization. Experimental results on two real-world CQA dataset show that the proposed method is able to recommend appropriate answerers for new questions and outperforms other state-of-the-art approaches.
Bortnikov, Edward, Donmez, Pinar, Kagian, Amit and Lempel, Ronny (2012): Modeling transactional queries via templates. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 13-24. Available online
Search queries have been roughly classified into three categories -- navigational, informational and transactional. The latter group includes queries that aim to perform some Web-mediated task, often by interacting with parameterized Web services. In order to assist users in completing tasks online, one of the first building blocks is identifying whether and which transactional use-case is associated with each query. This paper describes a framework and an algorithm for automatically generating compact representations of queries associated with transactional use cases. We mine search click logs for queries that lead to clicks on pages associated with a use-case, generalize the set of mined queries into templates by replacing query terms with taxonomy categories, and eliminate redundancies. This approach allows associating the use-case with queries unseen in the log sample, while keeping a concise model. Our methodology allows a business owner to select an appropriate operating point that balances the tradeoff between precision and recall. We report the results of an offline evaluation of our framework on three transactional domains, and demonstrate the viability of the approach.
Neumayer, Robert, Balog, Krisztian and Nřrvĺg, Kjetil (2012): On the modeling of entities for ad-hoc entity search in the web of data. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 133-145. Available online
The Web of Data describes objects, entities, or "things" in terms of their attributes and their relationships, using RDF statements. There is a need to make this wealth of knowledge easily accessible by means of keyword search. Despite recent research efforts in this direction, there is a lack of understanding of how structured semantic data is best represented for text-based entity retrieval. The task we are addressing in this paper is ad-hoc entity search: the retrieval of RDF resources that represent an entity described in the keyword query. We build upon and formalise existing entity modeling approaches within a generative language modeling framework, and compare them experimentally using a standard test collection, provided by the Semantic Search Challenge evaluation series. We show that these models outperform the current state-of-the-art in terms of retrieval effectiveness, however, this is done at the cost of abandoning a large part of the semantics behind the data. We propose a novel entity model capable of preserving the semantics associated with entities, without sacrificing retrieval effectiveness.
Berendsen, Richard, Kovachev, Bogomil, Nastou, Evangelia-Paraskevi, Rijke, Maarten de and Weerkamp, Wouter (2012): Result disambiguation in web people search. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 146-157. Available online
We study the problem of disambiguating the results of a web people search engine: given a query consisting of a person name plus the result pages for this query, find correct referents for all mentions by clustering the pages according to the different people sharing the name. While the problem has been studied extensively, we discover that the increasing availability of results retrieved from social media platforms causes state-of-the-art methods to break down. We analyze the problem and propose a dual strategy where we distinguish between results obtained from social media platforms and those obtained from other sources. In our dual strategy, the two types of documents are disambiguated separately, using different strategies, and their results are then merged. We study several instantiations for the different stages in our proposed strategy and manage to achieve state-of-the-art performance.
Robertson, Stephen (2012): On smoothing average precision. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 158-169. Available online
On the basis of a theoretical analysis of issues around populations and sampling, for both topics and documents, and parameters with which we hope to characterise the effectiveness of different systems, we propose a modification to the traditional average precision metric. This modification involves both transformation and (in the estimation of the parameter) smoothing. The modified version is shown to have certain distributional advantages, on a substantial dataset. In particular, the distribution of values of the modified metric, over topics for a given system/run, is approximately normal.
Eskevich, Maria, Magdy, Walid and Jones, Gareth J. F. (2012): New metrics for meaningful evaluation of informally structured speech retrieval. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 170-181. Available online
Search effectiveness for tasks where the retrieval units are clearly defined documents is generally evaluated using standard measures such as mean average precision (MAP). However, many practical speech search tasks focus on content within large spoken files lacking defined structure. These data must be segmented into smaller units for search which may only partially overlap with relevant material. We introduce two new metrics for the evaluation of search effectiveness for informally structured speech data: mean average segment precision (MASP) which measures retrieval performance in terms of both content segmentation and ranking with respect to relevance; and mean average segment distance-weighted precision (MASDWP) which takes into account the distance between the start of the relevant segment and the retrieved segment. We demonstrate the effectiveness of these new metrics on a retrieval test collection based on the AMI meeting corpus.
Hosseini, Mehdi, Cox, Ingemar J., Milic-Frayling, Nataa, Kazai, Gabriella and Vinay, Vishwa (2012): On aggregating labels from multiple crowd workers to infer relevance of documents. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 182-194. Available online
We consider the problem of acquiring relevance judgements for information retrieval (IR) test collections through crowdsourcing when no true relevance labels are available. We collect multiple, possibly noisy relevance labels per document from workers of unknown labelling accuracy. We use these labels to infer the document relevance based on two methods. The first method is the commonly used majority voting (MV) which determines the document relevance based on the label that received the most votes, treating all the workers equally. The second is a probabilistic model that concurrently estimates the document relevance and the workers accuracy using expectation maximization (EM). We run simulations and conduct experiments with crowdsourced relevance labels from the INEX 2010 Book Search track to investigate the accuracy and robustness of the relevance assessments to the noisy labels. We observe the effect of the derived relevance judgments on the ranking of the search systems. Our experimental results show that the EM method outperforms the MV method in the accuracy of relevance assessments and IR systems ranking. The performance improvements are especially noticeable when the number of labels per document is small and the labels are of varied quality.
On-line photo sharing services allow users to share their touristic experiences. Tourists can publish photos of interesting locations or monuments visited, and they can also share comments, annotations, and even the GPS traces of their visits. By analyzing such data, it is possible to turn colorful photos into metadata-rich trajectories through the points of interest present in a city. In this paper we propose a novel algorithm for the interactive generation of personalized recommendations of touristic places of interest based on the knowledge mined from photo albums and Wikipedia. The distinguishing features of our approach are multiple. First, the underlying recommendation model is built fully automatically in an unsupervised way and it can be easily extended with heterogeneous sources of information. Moreover, recommendations are personalized according to the places previously visited by the user. Finally, such personalized recommendations can be generated very efficiently even on-line from a mobile device.
Nawab, Rao Muhammad Adeel, Stevenson, Mark and Clough, Paul (2012): Retrieving candidate plagiarised documents using query expansion. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 207-218. Available online
External plagiarism detection systems compare suspicious texts against a reference collection to identify the original one(s). The suspicious text may not contain a verbatim copy of the reference collection since plagiarists often try to disguise their behaviour by altering the text. For large reference collections, such as those accessible via the internet, it is not practical to compare the suspicious text with every document in the reference collection. Consequently many approaches to plagiarism detection begin by identifying a set of candidate documents from the reference collection. We report an IR-based approach to the candidate document selection problem that uses query expansion to identify candidates which have been altered. The reported system outperforms a previously reported approach and is also robust to changes in the reference collection text.
Sondhi, Parikshit, Vydiswaran, V. G. Vinod and Zhai, Cheng Xiang (2012): Reliability prediction of webpages in the medical domain. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 219-231. Available online
In this paper, we study how to automatically predict reliability of web pages in the medical domain. Assessing reliability of online medical information is especially critical as it may potentially influence vulnerable patients seeking help online. Unfortunately, there are no automated systems currently available that can classify a medical webpage as being reliable, while manual assessment cannot scale up to process the large number of medical pages on the Web. We propose a supervised learning approach to automatically predict reliability of medical webpages. We developed a gold standard dataset using the standard reliability criteria defined by the Health on Net Foundation and systematically experimented with different link and content based feature sets. Our experiments show promising results with prediction accuracies of over
Tam, Tony, Ferreira, Artur and Lourenço, André (2012): Automatic foldering of email messages: a combination approach. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 232-243. Available online
Automatic organization of email messages into folders is both an open problem and challenge for machine learning techniques. Besides the effect of email overload, which affects many email users worldwide, there are some increasing difficulties caused by the semantics applied by each user. The varying number of folders and their meaning are personal and in many cases pose difficulties to learning methods. This paper addresses automatic organization of email messages into folders, based on supervised learning algorithms. The textual fields of the email message (subject and body) are considered for learning, with different representations, feature selection methods, and classifiers. The participant fields are embedded into a vector-space model representation. The classification decisions from the different email fields are combined by majority voting. Experiments on a subset of the Enron Corpus and on a private email data set show the significant improvement over both single classifiers on these fields as well as over previous works.
Lv, Yuanhua and Zhai, Chengxiang (2012): A log-logistic model-based interpretation of TF normalization of BM25. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 244-255. Available online
The effectiveness of BM25 retrieval function is mainly due to its sub-linear term frequency (TF) normalization component, which is controlled by a parameter k1. Although BM25 was derived based on the classic probabilistic retrieval model, it has been so far unclear how to interpret its parameter k1 probabilistically, making it hard to optimize the setting of this parameter. In this paper, we provide a novel probabilistic interpretation of the BM25 TF normalization and its parameter k1 based on a log-logistic model for the probability of seeing a document in the collection with a given level of TF. The proposed interpretation allows us to derive different approaches to estimation of parameter k1 based solely on the current collection without requiring any training data, thus effectively eliminating one free parameter from BM25. Our experiment results show that the proposed approaches can accurately predict the optimal k1 without requiring training data and achieve better or comparable retrieval performance to a well-tuned BM25 where k1 is optimized based on training data.
Harvey, Morgan and Elsweiler, David (2012): Exploring query patterns in email search. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 25-36. Available online
Despite Email being the most popular communication medium currently in use and that people have been shown to regularly re-use messages, very little is known about how people actually search within email clients. In this paper we present a detailed analysis of email search behaviour obtained from a study of 47 users. We uncover a number of behavioral patterns that contrast with those previously observed in web search. From our findings, we describe ways in which email search could be improved and conclude with a short discussion of possible future work.
Gerani, Shima, Zhai, Chengxiang and Crestani, Fabio (2012): Score transformation in linear combination for multi-criteria relevance ranking. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 256-267. Available online
In many Information Retrieval (IR) tasks, documents should be ranked based on a combination of multiple criteria. Therefore, we would need to score a document in each criterion aspect of relevance and then combine the criteria scores to generate a final score for each document. Linear combination of these aspect scores has so far been the dominant approach due to its simplicity and effectiveness. However, such a strategy of combination requires that the scores to be combined are "comparable" to each other, an assumption that generally does not hold due to the different ways of scoring each criterion. Thus it is necessary to transform the raw scores for different criteria appropriately to make them more comparable before combination. In this paper we propose a new principled approach to score transformation in linear combination, in which we would learn a separate non-linear transformation function for each relevance criterion based on the Alternating Conditional Expectation (ACE) algorithm and BoxCox Transformation. Experimental results show that the proposed method is effective and is also robust against non-linear perturbations of the original scores.
Karimzadehgan, Maryam and Zhai, Chengxiang (2012): Axiomatic analysis of translation language model for information retrieval. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 268-280. Available online
Statistical translation models have been shown to outperform simple document language models which rely on exact matching of words in the query and documents. A main challenge in applying translation models to ad hoc information retrieval is to estimate a translation model without training data. In this paper, we perform axiomatic analysis of translation language model for retrieval in order to gain insights about how to optimize the estimation of translation probabilities. We propose a set of constraints that a reasonable translation language model should satisfy. We check these constraints on the state-of-the-art translation estimation method based on Mutual Information and find that it does not satisfy most of the constraints. We then propose a new estimation method that better satisfies the defined constraints. Experimental results on representative TREC data sets show that the proposed new estimation method outperforms the existing Mutual Information-based estimation, suggesting that the proposed constraints are indeed helpful for designing better estimation methods for translation language model.
Li, Bo and Gaussier, Eric (2012): An information-based cross-language information retrieval model. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 281-292. Available online
We present in this paper well-founded cross-language extensions of the recently introduced models in the information-based family for information retrieval, namely the LL (log-logistic) and SPL (smoothed power law) models of . These extensions are based on (a) a generalization of the notion of information used in the information-based family, (b) a generalization of the random variables also used in this family, and (c) the direct expansion of query terms with their translations. We then review these extensions from a theoretical point-of-view, prior to assessing them experimentally. The results of the experimental comparisons between these extensions and existing CLIR systems, on three collections and three language pairs, reveal that the cross-language extension of the LL model provides a state-of-the-art CLIR system, yielding the best performance overall.
Dai, Keshi, Pavlu, Virgil, Kanoulas, Evangelos and Aslam, Javed A. (2012): Extended expectation maximization for inferring score distributions. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 293-304. Available online
Inferring the distributions of relevant and nonrelevant documents over a ranked list of scored documents returned by a retrieval system has a broad range of applications including information filtering, recall-oriented retrieval, metasearch, and distributed IR. Typically, the distribution of documents over scores is modeled by a mixture of two distributions, one for the relevant and one for the nonrelevant documents, and expectation maximization (EM) is run to estimate the mixture parameters. A large volume of work has focused on selecting the appropriate form of the two distributions in the mixture. In this work we consider the form of the distributions as a given and we focus on the inference algorithm. We extend the EM algorithm (a) by simultaneously considering the ranked lists of documents returned by multiple retrieval systems, and (b) by encoding in the algorithm the constraint that the same document retrieved by multiple systems should have the same, global, probability of relevance. We test the new inference algorithm using TREC data and we demonstrate that it outperforms the regular EM algorithm. It is better calibrated in inferring the probability of document's relevance, and it is more effective when applied on the task of metasearch.
Zuccon, Guido, Azzopardi, Leif, Zhang, Dell and Wang, Jun (2012): Top-k retrieval using facility location analysis. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 305-316. Available online
The top-k retrieval problem aims to find the optimal set of k documents from a number of relevant documents given the user's query. The key issue is to balance the relevance and diversity of the top-k search results. In this paper, we address this problem using Facility Location Analysis taken from Operations Research, where the locations of facilities are optimally chosen according to some criteria. We show how this analysis technique is a generalization of state-of-the-art retrieval models for diversification (such as the Modern Portfolio Theory for Information Retrieval), which treat the top-k search results like "obnoxious facilities" that should be dispersed as far as possible from each other. However, Facility Location Analysis suggests that the top-k search results could be treated like "desirable facilities" to be placed as close as possible to their customers. This leads to a new top-k retrieval model where the best representatives of the relevant documents are selected. In a series of experiments conducted on two TREC diversity collections, we show that significant improvements can be made over the current state-of-the-art through this alternative treatment of the top-k retrieval problem.
Kreuzer, Roman, Springmann, Michael, Kabary, Ihab Al and Schuldt, Heiko (2012): An interactive paper and digital pen interface for query-by-sketch image retrieval. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 317-328. Available online
A major challenge when dealing with large collections of digital images is to find relevant objects, especially when no metadata on the objects is available. Content-based image retrieval (CBIR) addresses this problem but usually lacks query images that are good enough to express the user's information need. Therefore, in Query-by-Sketch, CBIR has been considered with user provided sketches as query objects -- but so far, this has suffered from the limitations of existing user interfaces. In this paper, we present a novel user interface for query by sketch that exploits emergent interactive paper and digital pen technology. Users can draw sketches on paper in a user-friendly way. Search can be started interactively from the paper front-end, due to a streaming interface from the digital pen to the underlying CBIR system. We present the implementation of the interactive paper/digital pen interface on top of QbS, our system for CBIR using sketches, and we present in detail the evaluation of the system on the basis of the MIRFLICKR-25000 image collection.
Coelho, Filipe and Ribeiro, Cristina (2012): Image abstraction in crossmedia retrieval for text illustration. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 329-339. Available online
Text illustration is a multimedia retrieval task that consists in finding suitable images to illustrate text fragments such as blog entries, news reports or children stories. In this paper we describe a crossmedia retrieval system which, given a textual input, selects a short list of candidate images from a large media collection. This approach makes use of a recently proposed method to map metadata and visual features into a common textual representation that can be handled by traditional information retrieval engines. Content-based analysis is enhanced by visual abstraction, namely the Anisotropic Kuwahara Filter, which impacts feature information captured by the Joint Composite and Speeded Up Robust Features visual descriptors. For evaluation purposes, we used the well-established MIRFlickr photo collection, with 25,000 photos and user tags collected from Flickr as well as manual annotations provided as image retrieval groundtruth. Results show that image abstraction can improve visual retrieval as well as significantly reduce processing and storage requirements, even more when paired with Google's WebP image format. We conclude that applying a visual rerank after an initial text retrieval step improves the quality of results, and that the adopted text mapping method for visual descriptors provides an effective crossmedia approach for text illustration.
Quattoni, Ariadna, Carreras, Xavier and Torralba, Antonio (2012): A latent variable ranking model for content-based retrieval. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 340-351. Available online
Since their introduction, ranking SVM models  have become a powerful tool for training content-based retrieval systems. All we need for training a model are retrieval examples in the form of triplet constraints, i.e. examples specifying that relative to some query, a database item a should be ranked higher than database item b. These types of constraints could be obtained from feedback of users of the retrieval system. Most previous ranking models learn either a global combination of elementary similarity functions or a combination defined with respect to a single database item. Instead, we propose a "coarse to fine" ranking model where given a query we first compute a distribution over "coarse" classes and then use the linear combination that has been optimized for queries of that class. These coarse classes are hidden and need to be induced by the training algorithm. We propose a latent variable ranking model that induces both the latent classes and the weights of the linear combination for each class from ranking triplets. Our experiments over two large image datasets and a text retrieval dataset show the advantages of our model over learning a global combination as well as a combination for each test point (i.e. transductive setting). Furthermore, compared to the transductive approach our model has a clear computational advantages since it does not need to be retrained for each test query.
Parapar, Javier and Barreiro, Alvaro (2012): Language modelling of constraints for text clustering. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 352-363. Available online
Constrained clustering is a recently presented family of semi-supervised learning algorithms. These methods use domain information to impose constraints over the clustering output. The way in which those constraints (typically pair-wise constraints between documents) are introduced is by designing new clustering algorithms that enforce the accomplishment of the constraints. In this paper we present an alternative approach for constrained clustering where, instead of defining new algorithms or objective functions, the constraints are introduced modifying the document representation by means of their language modelling. More precisely the constraints are modelled using the well-known Relevance Models successfully used in other retrieval tasks such as pseudo-relevance feedback. To the best of our knowledge this is the first attempt to try such approach. The results show that the presented approach is an effective method for constrained clustering even improving the results of existing constrained clustering algorithms.
Bosma, Maarten, Meij, Edgar and Weerkamp, Wouter (2012): A framework for unsupervised spam detection in social networking sites. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 364-375. Available online
Social networking sites offer users the option to submit user spam reports for a given message, indicating this message is inappropriate. In this paper we present a framework that uses these user spam reports for spam detection. The framework is based on the HITS web link analysis framework and is instantiated in three models. The models subsequently introduce propagation between messages reported by the same user, messages authored by the same user, and messages with similar content. Each of the models can also be converted to a simple semi-supervised scheme. We test our models on data from a popular social network and compare the models to two baselines, based on message content and raw report counts. We find that our models outperform both baselines and that each of the additions (reporters, authors, and similar messages) further improves the performance of the framework.
Diriye, Abdigani, Kumaran, Giridhar and Huang, Jeff (2012): Interactive search support for difficult web queries. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 37-49. Available online
Short and common web queries are aptly supported by state-of-the-art search engines but performance and user experience are degraded when web queries are longer and less common. Extending previous solutions that automatically shorten queries, we introduce searchAssist: a novel search interface that provides interactive support for difficult web queries. The query logs and questionnaires from a naturalistic study of 90 web users' search behaviors show
Vitale, Daniele, Ferragina, Paolo and Scaiella, Ugo (2012): Classification of short texts by deploying topical annotations. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 376-387. Available online
We propose a novel approach to the classification of short texts based on two factors: the use of Wikipedia-based annotators that have been recently introduced to detect the main topics present in an input text, represented via Wikipedia pages, and the design of a novel classification algorithm that measures the similarity between the input text and each output category by deploying only their annotated topics and the Wikipedia link-structure. Our approach waives the common practice of expanding the feature-space with new dimensions derived either from explicit or from latent semantic analysis. As a consequence it is simple and maintains a compact intelligible representation of the output categories. Our experiments show that it is efficient in construction and query time, accurate as state-of-the-art classifiers (see e.g. Phan et al. WWW '08), and robust with respect to concept drifts and input sources.
Tholpadi, Goutham, Das, Mrinal Kanti, Bhattacharyya, Chiranjib and Shevade, Shirish (2012): Cluster labeling for multilingual scatter/gather using comparable corpora. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 388-400. Available online
Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the multilingual setting, especially in the absence of dictionaries/machine translation. In this paper, we study the cluster labeling problem for multilingual corpora in the absence of machine translation, but using comparable corpora. Using a variational approach, we show that multilingual topic models can effectively handle the cluster labeling problem, which in turn allows us to design a novel Scatter/Gather system ShoBha. Experimental results on three datasets, namely the Canadian Hansards corpus, the entire overlapping Wikipedia of English, Hindi and Bengali articles, and a trilingual news corpus containing 41,000 articles, confirm the utility of the proposed system.
Alici, Sadiye, Altingovde, Ismail Sengor, Ozcan, Rifat, Cambazoglu, B. Barla and Ulusoy, Özgür (2012): Adaptive time-to-live strategies for query result caching in web search engines. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 401-412. Available online
An important research problem that has recently started to receive attention is the freshness issue in search engine result caches. In the current techniques in literature, the cached search result pages are associated with a fixed time-to-live (TTL) value in order to bound the staleness of search results presented to the users, potentially as part of a more complex cache refresh or invalidation mechanism. In this paper, we propose techniques where the TTL values are set in an adaptive manner, on a per-query basis. Our results show that the proposed techniques reduce the fraction of stale results served by the cache and also decrease the fraction of redundant query evaluations on the search engine backend compared to a strategy using a fixed TTL value for all queries.
Jonassen, Simon and Bratsberg, Svein Erik (2012): Intra-query concurrent pipelined processing for distributed full-text retrieval. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 413-425. Available online
Pipelined query processing over a term-wise distributed inverted index has superior throughput at high query multiprogramming levels. However, due to long query latencies this approach is inefficient at lower levels. In this paper we explore two types of intra-query parallelism within the pipelined approach, parallel execution of a query on different nodes and concurrent execution on the same node. According to the experimental results, our approach reaches the throughput of the state-of-the-art method at about half of the latency. On the single query case the observed latency improvement is up to 2.6 times.
Baeza-Yates, Ricardo and Jonassen, Simon (2012): Modeling static caching in web search engines. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 436-446. Available online
In this paper we model a two-level cache of a Web search engine, such that given memory resources, we find the optimal split fraction to allocate for each cache, results and index. The final result is very simple and implies to compute just five parameters that depend on the input data and the performance of the search engine. The model is validated through extensive experimental results and is motivated on capacity planning and the overall optimization of the search architecture.
Hienert, Daniel, Sawitzki, Frank, Schaer, Philipp and Mayr, Philipp (2012): Integrating interactive visualizations in the search process of digital libraries and IR systems. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 447-450. Available online
Interactive visualizations for exploring and retrieval have not yet become an integral part of digital libraries and information retrieval systems. We have integrated a set of interactive graphics in a real world social science digital library. These visualizations support the exploration of search queries, results and authors, can filter search results, show trends in the database and can support the creation of new search queries. The use of weighted brushing supports the identification of related metadata for search facets. In a user study we verify that users can gain insights from statistical graphics intuitively and can adopt interaction techniques.
Cummins, Ronan and O'Riordan, Colm (2012): On theoretically valid score distributions in information retrieval. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 451-454. Available online
In this paper, we aim to investigate the practical usefulness of the Recall-Fallout Convexity Hypothesis (RFCH) for a number of document score distribution (SD) models. We compare SD models that do not automatically adhere to the RFCH to modified versions of the same SD models that do adhere to the RFCH. We compare these models using the inference of average precision as a measure of utility. For the three models studied in this paper, we conclude that adhering to the RFCH is practically useful for the two-normal model, makes no difference for the two-gamma model, and degrades the performance of the two-lognormal model.
Peetz, Maria-Hendrike, Meij, Edgar, Rijke, Maarten de and Weerkamp, Wouter (2012): Adaptive temporal query modeling. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 455-458. Available online
We present an approach to query modeling that uses the temporal distribution of documents in an initially retrieved set of documents. Such distributions tend to exhibit bursts, especially in news-related document collections. We hypothesize that documents in those bursts are more likely to be relevant and update the query model with the most distinguishing terms in high-quality documents sampled from bursts. We evaluate the effectiveness of our models on a test collection of blog posts.
Do, Trien V. and Ruddle, Roy A. (2012): The design of a visual history tool to help users refind information within a website. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 459-462. Available online
On the WWW users frequently revisit information they have previously seen, but "keeping found things found" is difficult when the information has not been visited frequently or recently, even if a user knows which website contained the information. This paper describes the design of a tool to help users refind information within a given website. The tool encodes data about a user's interest in webpages (measured by dwell time), the frequency and recency of visits, and navigational associations between pages, and presents navigation histories in list- and graph-based forms.
Chelaru, Sergiu, Altingovde, Ismail Sengor and Siersdorfer, Stefan (2012): Analyzing the polarity of opinionated queries. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 463-467. Available online
In this paper, we present an in-depth analysis of Web search queries for controversial topics, focusing on query sentiment. To this end, we conduct extensive user assessments as well as an automatic sentiment analysis using the SentiWordNet thesaurus.
Martinez-Alvarez, Miguel, Yahyaei, Sirvan and Roelleke, Thomas (2012): Semi-automatic document classification: exploiting document difficulty. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 468-471. Available online
There are circumstances where classification is required only if a certain condition, such a specific level of quality, is met. This paper investigates a semi-automatic solution where only the predictions for the documents which are more likely to be correctly classified would be considered. This method provides high-quality automatic classification for large subsets of the collection and employs human expertise for the "most complicated" decisions. This research presents different approaches to measure document difficulty and it discusses the benefits of applying it for semi-automatic classification. In addition, experiments are carried out to show the results achieved for different subsets of the collection. Experiments prove that it is possible to improve quality significantly with large subsets (i.e. 13% micro-f1 increase with 70% of documents) of two different collections. Furthermore, it shows how it provides a flexible mechanism to apply automatic classification to specific subsets while specific constrains are met.
Aker, Ahmet, Fan, Xin, Sanderson, Mark and Gaizauskas, Robert (2012): Investigating summarization techniques for geo-tagged image indexing. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 472-475. Available online
Images with geo-tagging information are increasingly available on the Web. However, such images need to be annotated with additional textual information if they are to be retrievable, since users do not search by geo-coordinates. We propose to automatically generate such textual information by (1) generating toponyms from the geo-tagging information (2) retrieving Web documents using toponyms as queries (3) summarizing the retrieved documents. The summaries are then used to index the images. In this paper we investigate how various summarization techniques affect image retrieval performance and show significant improvements can be obtained when using the summaries for indexing.
Chheda, Parin, Faruqui, Manaal and Mitra, Pabitra (2012): Handling OOV words in Indian-language -- English CLIR. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 476-479. Available online
Because of the lack of resources Cross-lingual information retrieval is a difficult task for many Indian languages. Google Translate provides an easy way of translation from Indian languages to English but due to lexicon limitations most of the out-of-vocabulary words get transliterated letter by letter along with their suffix resulting in an unusually long string. The resulting string often does not match its intended translation which hurts retrieval. We propose an approach to extract the correct word from such strings using word segmentation along with approximate string matching using Soundex algorithm&Levenshtein distance. We evaluate our approach across three Indian languages and find an average improvement of 5.8% MAP on the FIRE-2010 dataset.
Boudin, Florian, Nie, Jian-Yun and Dawes, Martin (2012): Using a medical thesaurus to predict query difficulty. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 480-484. Available online
Estimating query performance is the task of predicting the quality of results returned by a search engine in response to a query. In this paper, we focus on pre-retrieval prediction methods for the medical domain. We propose a novel predictor that exploits a thesaurus to ascertain how difficult queries are. In our experiments, we show that our predictor outperforms the state-of-the-art methods that do not use a thesaurus.
Devezas, José, Coelho, Filipe, Nunes, Sérgio and Ribeiro, Cristina (2012): Studying a personality coreference network in a news stories photo collection. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 485-488. Available online
We build and analyze a coreference network based on entities from photo descriptions, where nodes represent personalities and edges connect people mentioned in the same photo description. We identify and characterize the communities in this network and propose taking advantage of the context provided by community detection methodologies to improve text illustration and general search.
Das, Sujatha, Mitra, Prasenjit and Giles, C. Lee (2012): Phrase pair classification for identifying subtopics. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 489-493. Available online
Automatic identification of subtopics for a given topic is desirable because it eliminates the need for manual construction of domain-specific topic hierarchies. In this paper, we design features based on corpus statistics to design a classifier for identifying the (subtopic, topic) links between phrase pairs. We combine these features along with the commonly-used syntactic patterns to classify phrase pairs from datasets in Computer Science and WordNet. In addition, we show a novel application of our is-a-subtopic-of classifier for query expansion in Expert Search and compare it with pseudo-relevance feedback.
Gallé, Matthias and Renders, Jean-Michel (2012): Full and mini-batch clustering of news articles with Star-EM. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 494-498. Available online
We present a new threshold-based clustering algorithm for news articles. The algorithm consists of two phases: in the first, a local optimum of a score function that captures the quality of a clustering is found with an Expectation-Maximization approach. In the second phase, the algorithm reduces the number of clusters and, in particular, is able to build non-spherical -- shaped clusters. We also give a mini-batch version which allows an efficient dynamic processing of data points as they arrive in groups. Our experiments on the TDT5 benchmark collection show the superiority of both versions of this algorithm compared to other state-of-the-art alternatives.
Zhou, Ke, Cummins, Ronan, Halvey, Martin, Lalmas, Mounia and Jose, Joemon M. (2012): Assessing and predicting vertical intent for web queries. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 499-502. Available online
Aggregating search results from a variety of heterogeneous sources, i.e. so-called verticals , such as news, image, video and blog, into a single interface has become a popular paradigm in web search. In this paper, we present the results of a user study that collected more than 1,500 assessments of vertical intent over 320 web topics. Firstly, we show that users prefer diverse vertical content for many queries and that the level of inter-assessor agreement for the task is fair . Secondly, we propose a methodology to predict the vertical intent of a query using a search engine log by exploiting click-through data, and show that it outperforms traditional approaches.
The amount of news content on the Web is increasing, enabling users to access news articles coming from a variety of sources: from newswires, news agencies, blogs, and at various places, e.g. even within Web search engines result pages. Anyhow, it still is a challenge for current search engines to decide which news events are worth being shown to the user (either for a newsworthy query or in a news portal). In this paper we define the task of predicting the future impact of news events. Being able to predict event impact will, for example, enable a newspaper to decide whether to follow a specific event or not, or a news search engine which stories to display. We define a flexible framework that, given some definition of impact, can predict its future development at the beginning of the event. We evaluate several possible definitions of event impact and experimentally identify the best features for each of them.
Oghina, Andrei, Breuss, Mathias, Tsagkias, Manos and Rijke, Maarten de (2012): Predicting IMDB movie ratings using social media. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 503-507. Available online
We predict IMDb movie ratings and consider two sets of features: surface and textual features. For the latter, we assume that no social media signal is isolated and use data from multiple channels that are linked to a particular movie, such as tweets from Twitter and comments from YouTube. We extract textual features from each channel to use in our prediction model and we explore whether data from either of these channels can help to extract a better set of textual feature for prediction. Our best performing model is able to rate movies very close to the observed values.
Toraman, Cagri and Can, Fazli (2012): Squeezing the ensemble pruning: faster and more accurate categorization for news portals. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 508-511. Available online
Recent studies show that ensemble pruning works as effective as traditional ensemble of classifiers (EoC). In this study, we analyze how ensemble pruning can improve text categorization efficiency in time-critical real-life applications such as news portals. The most crucial two phases of text categorization are training classifiers and assigning labels to new documents; but the latter is more important for efficiency of such applications. We conduct experiments on ensemble pruning-based news article categorization to measure its accuracy and time cost. The results show that our heuristics reduce the time cost of the second phase. Also we can make a trade-off between accuracy and time cost to improve both of them with appropriate pruning degrees.
Mantrach, Amin and Renders, Jean-Michel (2012): A general framework for people retrieval in social media with multiple roles. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 512-516. Available online
Internet users are more and more playing multiple roles when connected on the Web, such as "posting", "commenting", "tagging" and "sharing" different kinds of information on various social media. Despite the research interest in the field of social networks, few has been done up to now w.r.t. information access in multi-relational social networks where queries can be multifaceted queries (e.g. a mix of textual key-words and key-persons in some social context). We propose a unified and efficient framework to address such complex queries on multi-modal "social" collections, working in 3 distinct phases, namely: (I) aggregation of documents into modal profiles, (II) expansion of mono-modal subqueries to mono-modal and multi-modal subqueries, (III) relevance score computation through late fusion of the different similarities deduced from profiles and subqueries obtained during the first two phases. Experiments on the ENRON email collection for a recipient proposal task show that competitive results can be obtained using the proposed framework.
Albakour, M-Dyaa, Kruschwitz, Udo, Nanas, Nikolaos, Adeyanju, Ibrahim, Song, Dawei, Fasli, Maria and Roeck, Anne De (2012): Analysis of query reformulations in a search engine of a local web site. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 517-521. Available online
This study examines reformulations of queries submitted to a search engine of a university Web site with a focus on (implicitly derived) user satisfaction and the performance of the underlying search engine. Using a search log of a university Web site we examined all reformulations submitted in a 10-week period and studied the relation between the popularity of the reformulation and the performance of the search engine estimated using a number of clickthrough-based measures. Our findings are a step towards building better query recommendation systems and suggest a number of metrics to evaluate query recommendation systems.
Whiting, Stewart, Klampanos, Iraklis A. and Jose, Joemon M. (2012): Temporal pseudo-relevance feedback in microblog retrieval. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 522-526. Available online
Twitter has become a major outlet for news, discussion and commentary of on-going events and trends. Effective searching of Twitter collections poses a number of issues for traditional document-based information retrieval (IR) approaches, such as limited document term statistics and spam. In this paper we propose a novel approach to pseudo-relevance feedback, based upon the temporal profiles of n-grams extracted from the top N relevance feedback tweets. A weighted graph is used to model temporal correlation between n-grams, with a PageRank variant employed to combine both pseudo-relevant document term distribution and temporal collection evidence. Preliminary experiments with the TREC Microblogging 2011 Twitter corpus indicate that through parameter optimisation, retrieval effectiveness can be improved.
Lungley, Deirdre, Kruschwitz, Udo and Song, Dawei (2012): Learning adaptive domain models from click data to bootstrap interactive web search. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 527-530. Available online
Today, searchers exploring the World Wide Web have come to expect enhanced search interfaces -- query completion and related searches have become standard. Here we propose a Formal Concept Analysis lattice as an underlying domain model to provide a source of query refinements. The initial lattice is constructed using NLP. User clicks on documents, seen as implicit user feedback, are harnessed to adapt it. In this paper, we explore the viability of this adaptation process and the results we present demonstrate its promise and limitations for proposing initial effective refinements when searching the diverse WWW domain.
Diriye, Abdigani, Tombros, Anastasios and Blandford, Ann (2012): A little interaction can go a long way: enriching the query formulation process. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 531-534. Available online
This poster argues for a need for more dialogue and richer information and interaction during query formulation between the user and the system. We present two novel methods -- query previews and categorised Interactive Query Expansions -- that seek to do just this. Our method enriches a searcher's query formulation by leveraging semantic information to help identify the topicality of the term, and the outcomes of its selection. The initial findings are largely positive and suggest user preference.
Lubell-Doughtie, Peter and Hofmann, Katja (2012): Learning to rank from relevance feedback for e-discovery. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 535-539. Available online
In recall-oriented search tasks retrieval systems are privy to a greater amount of user feedback. In this paper we present a novel method of combining relevance feedback with learning to rank. Our experiments use data from the 2010 TREC Legal track to demonstrate that learning to rank can tune relevance feedback to improve result rankings for specific queries, even with limited amounts of user feedback.
Neumayer, Robert, Balog, Krisztian and Nřrvĺg, Kjetil (2012): When simple is (more than) good enough: effective semantic search with (almost) no semantics. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 540-543. Available online
Using keyword queries to find entities has emerged as one of the major search types on the Web. In this paper, we study the task of ad-hoc entity retrieval: keyword search in a collection of structured data. We start with a baseline retrieval system that constructs pseudo documents from RDF triples and introduce three extensions: preprocessing of URIs, using two-fielded retrieval models, and boosting popular domains. Using the query sets of the 2010 and 2011 Semantic Search Challenge, we show that our straightforward approach outperforms all previously reported results, some generated by far more complex systems.
Kelly, Liadh, Bunbury, Paul and Jones, Gareth J. F. (2012): Evaluating personal information retrieval. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 544-547. Available online
Evaluation of personal search over an individual's personal information space on the desktop or elsewhere is problematic for reasons relating both to the personal and private nature of the data and the associated personal information needs of collection owners. Indeed challenges associated with evaluation in this space are recognised as one of the key factors hindering the development of research in personal information retrieval. We present the "personal information retrieval evaluation (PIRE)" tool, which provides a solution to this evaluation problem using a 'living laboratory' approach. This tool allows for the evaluation of retrieval techniques using 'real' individuals' personal collections, queries and result sets, in a cross-comparable repeatable way, while importantly maintaining an individual's informational privacy.
Bloom, Niels (2012): Applying power graph analysis to weighted graphs. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 548-551. Available online
We expanded Power Graph Analysis for use with weighted graphs, applying the technique to document categorisation with promising results. With the additional weight information we were able to create more accurate representations of the underlying data while maintaining a high level of edge reduction and improving visualisation of the graph.
Ferguson, Paul, O'Hare, Neil, Lanagan, James, Phelan, Owen and McCarthy, Kevin (2012): An investigation of term weighting approaches for microblog retrieval. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 552-555. Available online
The use of effective term frequency weighting and document length normalisation strategies have been shown over a number of decades to have a significant positive effect for document retrieval. When dealing with much shorter documents, such as those obtained from microblogs, it would seem intuitive that these would have less benefit. In this paper we investigate their effect on microblog retrieval performance using the Tweets2011 collection from the TREC 2011 Microblog Track.
Atilgan, Duygu, Altingovde, Ismail Sengor and Ulusoy, Özgür (2012): On the size of full element-indexes for XML keyword search. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 556-560. Available online
We show that a full element-index can be as space-efficient as a direct index with Dewey ids, after compression using typical techniques.
In this paper, we present a new methodology aimed at retrieving relevant product aspects from a collection of customer reviews, as well as the most salient sentiments expressed about them. Our proposal is both unsupervised and domain independent, and does not relies on NLP techniques such as parsing or dependence analysis. In our experiments, the proposed method achieves good values of precision. It is also shown that our approach is capable of properly retrieving the relevant aspects and their sentiments even from individual reviews.
Ozcan, Rifat, Altingovde, Ismail Sengor and Ulusoy, Özgür (2012): In praise of laziness: a lazy strategy for web information extraction. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 565-568. Available online
A large number of Web information extraction algorithms are based on machine learning techniques. For such extraction algorithms, we propose employing a lazy learning strategy to build a specialized model for each test instance to improve the extraction accuracy and avoid the disadvantages of constructing a single general model.
Alhadi, Arifah Che, Gottron, Thomas, Kunegis, Jérôme and Naveed, Nasir (2012): LiveTweet: monitoring and predicting interesting microblog posts. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 569-570. Available online
This paper describes the LiveTweet application, a system for automatically analysing and predicting the interestingness of microblog posts. Based on a stream of recent microblog posts the system tracks user interactions on Twitter that indicate interesting content. An incremental Naive Bayes model is trained to learn the characteristics of tweets which are considered interesting by the users. Finally, the probability of a microblog post to be retweeted is used as metric for its interestingness.
Giangreco, Ivan, Springmann, Michael, Kabary, Ihab Al and Schuldt, Heiko (2012): A user interface for query-by-sketch based image retrieval with color sketches. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 571-572. Available online
This demo will interactively show a system that exploits a novel user interface, running on Tablet PCs or graphic tablets, that provides query-by-sketch based image retrieval using color sketches. The system uses Angular Radial Partitioning (ARP) for the edge information in the sketches and color moments in the CIELAB space, combined with a distance metric that is robust to deviations in color as they usually need to be taken into account with user-generated color sketches.
Maxwell, David, Raue, Stefan, Azzopardi, Leif, Johnson, Chris and Oates, Sarah (2012): Crisees: real-time monitoring of social media streams to support crisis management. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 573-575. Available online
The Crisees demonstrator is a service that aggregates and collects social media streams to support Crisis Managment.
Mantrach, Amin and Renders, Jean-Michel (2012): A mailbox search engine using query multi-modal expansion and community-based smoothing. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 576-577. Available online
This demo introduces a new tool (or plug-in) for any email client that automatically decomposes the (personal or shared) mailbox into new virtual folders, corresponding to topics and communities, in an unsupervised way to lighten end-user load. The proposed software implements a retrieval system where the user can search for emails but also for people by submitting a double-faceted query: "key words" and "key persons". The software is able to retrieve three kind of documents that a matching search-based system would not retrieve. Firstly, by using person profiles, the software will rank documents related to the key persons without requiring them to be participant (i.e. being author or recipient). Secondly, the system will retrieve documents sharing the same topics as the key words but not necessarily containing them. Thirdly, the proposed solution will also retrieve other participants who are members of the communities associated to the key persons.
Strötgen, Jannik, Alonso, Omar and Gertz, Michael (2012): Retro: time-based exploration of product reviews. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 581-582. Available online
Most e-commerce websites organize and present product reviews around ratings with hardly any feature to view them in a time-oriented way. Often, there is a way to sort reviews by time but no further temporal analysis is possible. Thus, usually, only few reviews are part of a user's review analysis process, and there is no way to analyze all reviews of a product collectively. In this paper, we describe Retro, a search engine for exploring product reviews using temporal information.
Diriye, Abdigani and Golovchinsky, Gene (2012): Querium: a session-based collaborative search system. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 583-584. Available online
People's information-seeking can span multiple sessions, and can be collaborative in nature. Existing commercial offerings do not effectively support searchers to share, save, collaborate or revisit their information. In this demo paper we present Querium: a novel session-based collaborative search system that lets users search, share, resume and collaborate with other users. Querium provides a number of novel search features in a collaborative setting, including relevance feedback, query fusion, faceted search, and search histories.
Polajnar, Tamara, Glassey, Richard and Azzopardi, Leif (2012): Detection of news feeds items appropriate for children. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 63-72. Available online
Identifying child-appropriate web content is an important yet difficult classification task. This novel task is characterised by attempting to determine age/child appropriateness (which is not necessarily topic-based), despite the presence of unbalanced class sizes and the lack of quality training data with human judgements of appropriateness. Classification of feeds, a subset of web content, presents further challenges due to their temporal nature and short document format. In this paper, we discuss these challenges and present baseline results for this task through an empirical study that classifies incoming news stories as appropriate (or not) for children. We show that while the naďve Bayes approach produces a higher AUC it is vulnerable to the imbalanced data problem, and that support vector machine provides a more robust overall solution. Our research shows that classifying children's content is a non-trivial task that has greater complexities than standard text based classification. While the F-score values are consistent with other research examining age-appropriate text classification, we introduce a new problem with a new dataset.
Harvey, Morgan, Carman, Mark and Elsweiler, David (2012): Comparing tweets and tags for URLs. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 73-84. Available online
The free-form tags available from social bookmarking sites such as Delicious have been shown to be useful for a number of purposes and could serve as a cheap source of metadata about URLs on the web. Unfortunately recent years have seen a reduction in the popularity of such sites, however at the same time microblogging sites such as Twitter have exploded in popularity. On these sites users submit short messages (or "tweets") about what they are currently reading, thinking and doing and often post URLs. In this work we look into the similarity between top tags drawn from Delicious and high-frequency terms from tweets to ascertain whether Twitter data could serve as a useful replacement for Delicious. We investigate how these terms compare with web page content, whether or not top Twitter terms converge and determine if the terms are mostly descriptive (and therefore useful) or if they are mostly expressing sentiment or emotion. We discover that provided a large number of tweets are available referring to a chosen URL then the top terms drawn from these tweets are similar to Delicious tags and could therefore be used for similar purposes.
Hauff, Claudia and Houben, Geert-Jan (2012): Geo-Location estimation of flickr images: social web based enrichment. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 85-96. Available online
Estimating the geographic location of images is a task which has received a lot of attention in recent years. Large numbers of items uploaded to Flickr do not contain GPS-based latitude/longitude coordinates, although it would be beneficial to obtain such geographic information for a wide variety of potential applications such as travelogues and visual place descriptions. While most works in this area consider an image's textual meta-data to estimate its geo-location, we consider an additional textual dimension: the image owner's traces on the social Web, in particular on the micro-blogging platform Twitter. We investigate the following question: does enriching an image's available textual meta-data with a user's tweets improve the accuracy of the geographic location estimation process? The results show that this is indeed the case; in
Kim, Jin Young and Croft, W. Bruce (2012): A field relevance model for structured document retrieval. In: Proceedings of the 2012 BCS-IRSG European Conference on Information Retrieval 2012. pp. 97-108. Available online
Many search applications involve documents with structure or fields. Since query terms often are related to specific structural components, mapping queries to fields and assigning weights to those fields is critical for retrieval effectiveness. Although several field-based retrieval models have been developed, there has not been a formal justification of field weighting. In this work, we aim to improve the field weighting for structured document retrieval. We first introduce the notion of field relevance as the generalization of field weights, and discuss how it can be estimated using relevant documents, which effectively implements relevance feedback for field weighting. We then propose a framework for estimating field relevance based on the combination of several sources. Evaluation on several structured document collections show that field weighting based on the suggested framework improves retrieval effectiveness significantly.
09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Added 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified 09 Nov 2012: Modified
Page maintainer: The Editorial Team