Publication statistics

Pub. period:2005-2012
Pub. count:15
Number of co-authors:28


Number of publications with 3 favourite co-authors:

Nick Craswell:
Lin Liu:
Weidong Geng:



Productive colleagues

Jun Xu's 3 most productive colleagues in number of publications:

Wei-Ying Ma:95
Hang Li:34
Nick Craswell:24

Upcoming Courses

go to course
Gestalt Psychology and Web Design: The Ultimate Guide
92% booked. Starts in 3 days
go to course
The Psychology of Online Sales: The Beginner's Guide
88% booked. Starts in 7 days

Featured chapter

Marc Hassenzahl explains the fascinating concept of User Experience and Experience Design. Commentaries by Don Norman, Eric Reiss, Mark Blythe, and Whitney Hess

User Experience and Experience Design !


Our Latest Books

The Social Design of Technical Systems: Building technologies for communities. 2nd Edition
by Brian Whitworth and Adnan Ahmad
start reading
Gamification at Work: Designing Engaging Business Software
by Janaki Mythily Kumar and Mario Herger
start reading
The Social Design of Technical Systems: Building technologies for communities
by Brian Whitworth and Adnan Ahmad
start reading
The Encyclopedia of Human-Computer Interaction, 2nd Ed.
by Mads Soegaard and Rikke Friis Dam
start reading

Jun Xu


Publications by Jun Xu (bibliography)

 what's this?
Edit | Del

Wang, Quan, Cao, Zheng, Xu, Jun and Li, Hang (2012): Group matrix factorization for scalable topic modeling. In: Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2012. pp. 375-384. Available online

Topic modeling can reveal the latent structure of text data and is useful for knowledge discovery, search relevance ranking, document classification, and so on. One of the major challenges in topic modeling is to deal with large datasets and large numbers of topics in real-world applications. In this paper, we investigate techniques for scaling up the non-probabilistic topic modeling approaches such as RLSI and NMF. We propose a general topic modeling method, referred to as Group Matrix Factorization (GMF), to enhance the scalability and efficiency of the non-probabilistic approaches. GMF assumes that the text documents have already been categorized into multiple semantic classes, and there exist class-specific topics for each of the classes as well as shared topics across all classes. Topic modeling is then formalized as a problem of minimizing a general objective function with regularizations and/or constraints on the class-specific topics and shared topics. In this way, the learning of class-specific topics can be conducted in parallel, and thus the scalability and efficiency can be greatly improved. We apply GMF to RLSI and NMF, obtaining Group RLSI (GRLSI) and Group NMF (GNMF) respectively. Experiments on a Wikipedia dataset and a real-world web dataset, each containing about 3 million documents, show that GRLSI and GNMF can greatly improve RLSI and NMF in terms of scalability and efficiency. The topics discovered by GRLSI and GNMF are coherent and have good readability. Further experiments on a search relevance dataset, containing 30,000 labeled queries, show that the use of topics learned by GRLSI and GNMF can significantly improve search relevance.

© All rights reserved Wang et al. and/or ACM Press

Edit | Del

Li, Hang and Xu, Jun (2012): Beyond bag-of-words: machine learning for query-document matching in web search. In: Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2012. p. 1177. Available online

Edit | Del

Xu, Jun, Wu, Wei, Li, Hang and Xu, Gu (2011): A kernel approach to addressing term mismatch. In: Proceedings of the 2011 International Conference on the World Wide Web 2011. pp. 153-154. Available online

This paper addresses the problem of dealing with term mismatch in web search using 'blending'. In blending, the input query as well as queries similar to it are used to retrieve documents, the ranking results of documents with respect to the queries are combined to generate a new ranking list. We propose a principled approach to blending, using a kernel method and click-through data. Our approach consists of three elements: a way of calculating query similarity using click-through data, a mixture model for combination of rankings using relevance, query similarity, and document similarity scores, and an algorithm for learning the weights of blending model based on the kernel method. Large scale experiments on web search and enterprise search data sets show that our approach can effectively solve term mismatch problem and significantly outperform the baseline methods of query expansion and heuristic blending.

© All rights reserved Xu et al. and/or ACM Press

Edit | Del

Wang, Quan, Xu, Jun, Li, Hang and Craswell, Nick (2011): Regularized latent semantic indexing. In: Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2011. pp. 685-694. Available online

Topic modeling can boost the performance of information retrieval, but its real-world application is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps such as vastly reducing input vocabulary. We introduce Regularized Latent Semantic Indexing (RLSI), a new method which is designed for parallelization. It is as effective as existing topic models, and scales to larger datasets without reducing input vocabulary. RLSI formalizes topic modeling as a problem of minimizing a quadratic loss function regularized by l1 and/or l2; norm. This formulation allows the learning process to be decomposed into multiple sub-optimization problems which can be optimized in parallel, for example via MapReduce. We particularly propose adopting l1 norm on topics and l2 norm on document representations, to create a model with compact and readable topics and useful for retrieval. Relevance ranking experiments on three TREC datasets show that RLSI performs better than LSI, PLSI, and LDA, and the improvements are sometimes statistically significant. Experiments on a web dataset, containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance on a larger corpus and vocabulary than in previous studies.

© All rights reserved Wang et al. and/or ACM Press

Edit | Del

Li, Xiang, Xu, Jun, Ren, Yangchun and Geng, Weidong (2010): Animating cartoon faces by multi-view drawings. In Journal of Visualization and Computer Animation, 21 (3) pp. 193-201. Available online

Edit | Del

Liu, Lin, Zhang, Hongyu, Peng, Fei, Ma, Wenting, Shan, Yuhui, Xu, Jun and Burda, Tomas (2009): Understanding Chinese Characteristics of Requirements Engineering. In: RE 2009, 17th IEEE International Requirements Engineering Conference, Atlanta, Georgia, USA, August 31 - September 4, 2009 2009. pp. 261-266. Available online

Edit | Del

Xu, Jun, Liu, Tie-Yan, Lu, Min, Li, Hang and Ma, Wei-Ying (2008): Directly optimizing evaluation measures in learning to rank. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2008. pp. 107-114. Available online

One of the central issues in learning to rank for information retrieval is to develop algorithms that construct ranking models by directly optimizing evaluation measures used in information retrieval such as Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG). Several such algorithms including SVMmap and AdaRank have been proposed and their effectiveness has been verified. However, the relationships between the algorithms are not clear, and furthermore no comparisons have been conducted between them. In this paper, we conduct a study on the approach of directly optimizing evaluation measures in learning to rank for Information Retrieval (IR). We focus on the methods that minimize loss functions upper bounding the basic loss function defined on the IR measures. We first provide a general framework for the study and analyze the existing algorithms of SVMmap and AdaRank within the framework. The framework is based on upper bound analysis and two types of upper bounds are discussed. Moreover, we show that we can derive new algorithms on the basis of this analysis and create one example algorithm called PermuRank. We have also conducted comparisons between SVMmap, AdaRank, PermuRank, and conventional methods of Ranking SVM and RankBoost, using benchmark datasets. Experimental results show that the methods based on direct optimization of evaluation measures can always outperform conventional methods of Ranking SVM and RankBoost. However, no significant difference exists among the performances of the direct optimization methods themselves.

© All rights reserved Xu et al. and/or ACM Press

Edit | Del

Ni, Weijian, Xu, Jun, Li, Hang and Huang, Yalou (2008): Group-based learning: a boosting approach. In: Shanahan, James G., Amer-Yahia, Sihem, Manolescu, Ioana, Zhang, Yi, Evans, David A., Kolcz, Aleksander, Choi, Key-Sun and Chowdhury, Abdur (eds.) Proceedings of the 17th ACM Conference on Information and Knowledge Management - CIKM 2008 October 26-30, 2008, Napa Valley, California, USA. pp. 1443-1444. Available online

Edit | Del

Xu, Jun and Li, Hang (2007): AdaRank: a boosting algorithm for information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 391-398. Available online

In this paper we address the issue of learning to rank for document retrieval. In the task, a model is automatically created with some training data and then is utilized for ranking of documents. The goodness of a model is usually evaluated with performance measures such as MAP (Mean Average Precision) and NDCG (Normalized Discounted Cumulative Gain). Ideally a learning algorithm would train a ranking model that could directly optimize the performance measures with respect to the training data. Existing methods, however, are only able to train ranking models by minimizing loss functions loosely related to the performance measures. For example, Ranking SVM and RankBoost train ranking models by minimizing classification errors on instance pairs. To deal with the problem, we propose a novel learning algorithm within the framework of boosting, which can minimize a loss function directly defined on the performance measures. Our algorithm, referred to as AdaRank, repeatedly constructs 'weak rankers' on the basis of reweighted training data and finally linearly combines the weak rankers for making ranking predictions. We prove that the training process of AdaRank is exactly that of enhancing the performance measure used. Experimental results on four benchmark datasets show that AdaRank significantly outperforms the baseline methods of BM25, Ranking SVM, and RankBoost.

© All rights reserved Xu and Li and/or ACM Press

Edit | Del

Xu, Jun and Quaddus, Mohammed (2007): Exploring the Factors Influencing End Users' Acceptance of Knowledge Management Systems: Development of a Research Model of Adoption and Continued Use. In JOEUC, 19 (4) pp. 57-79. Available online

Edit | Del

Xu, Jun, Cao, Yunbo, Li, Hang, Craswell, Nick and Huang, Yalou (2007): Searching Documents Based on Relevance and Type. In: Amati, Giambattista, Carpineto, Claudio and Romano, Giovanni (eds.) Advances in Information Retrieva - 29th European Conference on IR Research - ECIR 2007 April 2-5, 2007, Rome, Italy. pp. 629-636. Available online

Edit | Del

Cao, Yunbo, Xu, Jun, Liu, Tie-Yan, Li, Hang, Huang, Yalou and Hon, Hsiao-Wuen (2006): Adapting ranking SVM to document retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2006. pp. 186-193. Available online

The paper is concerned with applying learning to rank to document retrieval. Ranking SVM is a typical method of learning to rank. We point out that there are two factors one must consider when applying Ranking SVM, in general a "learning to rank" method, to document retrieval. First, correctly ranking documents on the top of the result list is crucial for an Information Retrieval system. One must conduct training in a way that such ranked results are accurate. Second, the number of relevant documents can vary from query to query. One must avoid training a model biased toward queries with a large number of relevant documents. Previously, when existing methods that include Ranking SVM were applied to document retrieval, none of the two factors was taken into consideration. We show it is possible to make modifications in conventional Ranking SVM, so it can be better used for document retrieval. Specifically, we modify the "Hinge Loss" function in Ranking SVM to deal with the problems described above. We employ two methods to conduct optimization on the loss function: gradient descent and quadratic programming. Experimental results show that our method, referred to as Ranking SVM for IR, can outperform the conventional Ranking SVM and other existing methods for document retrieval on two datasets.

© All rights reserved Cao et al. and/or ACM Press

Edit | Del

Li, Hang, Cao, Yunbo, Xu, Jun, Hu, Yunhua, Li, Shenjie and Meyerzon, Dmitriy (2005): A new approach to intranet search based on information extraction. In: Herzog, Otthein, Schek, Hans-Jorg and Fuhr, Norbert (eds.) Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management October 31 - November 5, 2005, Bremen, Germany. pp. 460-468. Available online

Edit | Del

Quaddus, Mohammed, Xu, Jun and Hoque, Zohurul (2005): Factors of adoption of online auction: a China study. In: Li, Qi and Liang, Ting-Peng (eds.) Proceedings of the 7th International Conference on Electronic Commerce - ICEC 2005 August 15-17, 2005, Xian, China. pp. 93-100. Available online

Edit | Del

Xu, Jun, Cao, Yunbo, Li, Hang and Zhao, Min (2005): Ranking definitions with supervised learning methods. In: Proceedings of the 2005 International Conference on the World Wide Web 2005. pp. 811-819. Available online

This paper is concerned with the problem of definition search. Specifically, given a term, we are to retrieve definitional excerpts of the term and rank the extracted excerpts according to their likelihood of being good definitions. This is in contrast to the traditional approaches of either generating a single combined definition or simply outputting all retrieved definitions. Definition ranking is essential for the task. Methods for performing definition ranking are proposed in this paper, which formalize the problem as either classification or ordinal regression. A specification for judging the goodness of a definition is given. We employ SVM as the classification model and Ranking SVM as the ordinal regression model respectively, such that they rank definition candidates according to their likelihood of being good definitions. Features for constructing the SVM and Ranking SVM models are defined. An enterprise search system based on this method has been developed and has been put into practical use. Experimental results indicate that the use of SVM and Ranking SVM can significantly outperform the baseline methods of using heuristic rules or employing the conventional information retrieval method of Okapi. This is true both when the answers are paragraphs and when they are sentences. Experimental results also show that SVM or Ranking SVM models trained in one domain can be adapted to another domain, indicating that generic models for definition ranking can be constructed.

© All rights reserved Xu et al. and/or ACM Press

Add publication
Show list on your website

Join our community and advance:




Join our community!

Page Information

Page maintainer: The Editorial Team