Yiming Yang
About the author:
No description available of Yiming Yang...
Publications by Yiming Yang (bibliography)
» 2008 «
Harpale, Abhay S. and Yang, Yiming (2008): Personalized active learning for collaborative filtering. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2008. pp. 91-98. Available online
Collaborative Filtering (CF) requires user-rated training examples for statistical inference about the preferences of new users. Active learning strategies identify the most informative set of training examples through minimum interactions with the users. Current active learning approaches in CF make an implicit and unrealistic assumption that a user can provide rating for any queried item. This paper introduces a new approach to the problem which does not make such an assumption. We personalize active learning for the user, and query for only those items which the user can provide rating for. We propose an extended form of Bayesian active learning and use the Aspect Model for CF to illustrate and examine the idea. A comparative evaluation of the new method and a well-established baseline method on benchmark datasets shows statistically significant improvements with our method over the performance of the baseline method that is representative for existing approaches which do not take personalization into account.
Copyrights may apply
Rogati, Monica, Yang, Yiming and Carbonell, Jaime G. (2008): Corpus microsurgery: criteria optimization for medical cross-language ir. In: Shanahan, James G., Amer-Yahia, Sihem, Manolescu, Ioana, Zhang, Yi, Evans, David A., Kolcz, Aleksander, Choi, Key-Sun and Chowdhury, Abdur (eds.) Proceedings of the 17th ACM Conference on Information and Knowledge Management - CIKM 2008 October 26-30, 2008, Napa Valley, California, USA. pp. 1365-1366. Available online
» 2007 «
Yang, Yiming, Lad, Abhimanyu, Lao, Ni, Harpale, Abhay, Kisiel, Bryan and Rogati, Monica (2007): Utility-based information distillation over temporally sequenced documents. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007. pp. 31-38. Available online
This paper examines a new approach to information distillation over temporally ordered documents, and proposes a novel evaluation scheme for such a framework. It combines the strengths of and extends beyond conventional adaptive filtering, novelty detection and non-redundant passage ranking with respect to long-lasting information needs ("tasks" with multiple queries). Our approach supports fine-grained user feedback via highlighting of arbitrary spans of text, and leverages such information for utility optimization in adaptive settings. For our experiments, we defined hypothetical tasks based on news events in the TDT4 corpus, with multiple queries per task. Answer keys (nuggets) were generated for each query and a semi-automatic procedure was used for acquiring rules that allow automatically matching nuggets against system responses. We also propose an extension of the NDCG metric for assessing the utility of ranked passages as a combination of relevance and novelty. Our results show encouraging utility enhancements using the new approach, compared to the baseline systems without incremental learning or the novelty detection components.
Copyrights may apply
Lad, Abhimanyu and Yang, Yiming (2007): Generalizing from relevance feedback using named entity wildcards. In: Silva, Mario J., Laender, Alberto H. F., Baeza-Yates, Ricardo A., McGuinness, Deborah L., Olstad, Bjørn, Olsen, Øystein Haug and Falcão, André O. (eds.) Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management - CIKM 2007 November 6-10, 2007, Lisbon, Portugal. pp. 721-730. Available online
» 2005 «
Yang, Yiming, Yoo, Shinjae, Zhang, Jian and Kisiel, Bryan (2005): Robustness of adaptive filtering methods in a cross-benchmark evaluation. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2005. pp. 98-105. Available online
This paper reports a cross-benchmark evaluation of regularized logistic regression (LR) and incremental Rocchio for adaptive filtering. Using four corpora from the Topic Detection and Tracking (TDT) forum and the Text Retrieval Conferences (TREC) we evaluated these methods with non-stationary topics at various granularity levels, and measured performance with different utility settings. We found that LR performs strongly and robustly in optimizing T11SU (a TREC utility function) while Rocchio is better for optimizing Ctrk (the TDT tracking cost), a high-recall oriented objective function. Using systematic cross-corpus parameter optimization with both methods, we obtained the best results ever reported on TDT5, TREC10 and TREC11. Relevance feedback
Copyrights may apply
Li, Fan and Yang, Yiming (2005): Analysis of recursive feature elimination methods. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2005. pp. 633-634. Available online
Liu, Tie-Yan, Yang, Yiming, Wan, Hao, ZHOU, Qian, Gao, Bin, Zeng, Hua-Jun, Chen, Zheng and Ma, Wei-Ying (2005): An experimental study on large-scale web categorization. In: Proceedings of the 2005 International Conference on the World Wide Web 2005. pp. 1106-1107. Available online
Taxonomies of the Web typically have hundreds of thousands of categories and skewed category distribution over documents. It is not clear whether existing text classification technologies can perform well on and scale up to such large-scale applications. To understand this, we conducted the evaluation of several representative methods (Support Vector Machines, k-Nearest Neighbor and Naive Bayes) with Yahoo! taxonomies. In particular, we evaluated the effectiveness/efficiency tradeoff in classifiers with hierarchical setting compared to conventional (flat) setting, and tested popular threshold tuning strategies for their scalability and accuracy in large-scale classification problems.
Copyrights may apply
» 2004 «
Rogati, Monica and Yang, Yiming (2004): Resource selection for domain-specific cross-lingual IR. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2004. pp. 154-161. Available online
An under-explored question in cross-language information retrieval (CLIR) is to what degree the performance of CLIR methods depends on the availability of high-quality translation resources for particular domains. To address this issue, we evaluate several competitive CLIR methods -- with different training corpora -- on test documents in the medical domain. Our results show severe performance degradation when using a general-purpose training corpus or a commercial machine translation system (SYSTRAN), versus a domain-specific training corpus. A related unexplored question is whether we can improve CLIR performance by systematically analyzing training resources and optimally matching them to target collections. We start exploring this problem by suggesting a simple criterion for automatically matching training resources to target corpora. By using cosine similarity between training and target corpora as resource weights we obtained an average of 5.6% improvement over using all resources with no weights. The same metric yields 99.4% of the performance obtained when an oracle chooses the optimal resource every time.
Copyrights may apply
» 2003 «
Yang, Yiming, Zhang, Jian and Kisiel, Bryan (2003): A scalability analysis of classifiers in text categorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2003. pp. 96-103. Available online
Real-world applications of text categorization often require a system to deal with tens of thousands of categories defined over a large taxonomy. This paper addresses the problem with respect to a set of popular algorithms in text categorization, including Support Vector Machines, k-nearest neighbor, ridge regression, linear least square fit and logistic regression. By providing a formal analysis of the computational complexity of each classification method, followed by an investigation on the usage of different classifiers in a hierarchical setting of categorization, we show how the scalability of a method depends on the topology of the hierarchy and the category distributions. In addition, we are able to obtain tight bounds for the complexities by using the power law to approximate category distributions over a hierarchy. Experiments with kNN and SVM classifiers on the OHSUMED corpus are reported on, as concrete examples.
Copyrights may apply
Zhang, Jian and Yang, Yiming (2003): Robustness of regularized linear classification methods in text categorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2003. pp. 190-197. Available online
Real-world applications often require the classification of documents under situations of small number of features, mis-labeled documents and rare positive examples. This paper investigates the robustness of three regularized linear classification methods (SVM, ridge regression and logistic regression) under above situations. We compare these methods in terms of their loss functions and score distributions, and establish the connection between their optimization problems and generalization error bounds. Several sets of controlled experiments on the Reuters-21578 corpus are conducted to investigate the robustness of these methods. Our results show that ridge regression seems to be the most promising candidate for rare class problems.
Copyrights may apply
Yang, Yiming and Kisiel, Bryan (2003): Margin-based local regression for adaptive filtering. In: Proceedings of the 2003 ACM CIKM International Conference on Information and Knowledge Management November 2-8, 2003, New Orleans, Louisiana, USA. pp. 191-198. Available online
» 2002 «
Liu, Yan, Yang, Yiming and Carbonell, Jaime G. (2002): Boosting to correct inductive bias in text classification. In: Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management November 4-9, 2002, McLean, VA, USA. pp. 348-355. Available online
Rogati, Monica and Yang, Yiming (2002): High-performing feature selection for text classification. In: Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management November 4-9, 2002, McLean, VA, USA. pp. 659-661. Available online
» 2001 «
Yang, Yiming (2001): A study of thresholding strategies for text categorization. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2001. pp. 137-145. Available online
Thresholding strategies in automated text categorization are an underexplored area of research. This paper presents an examination of the effect of thresholding strategies on the performance of a classifier under various conditions. Using k-Nearest Neighbor (kNN) as the classifier and five evaluation benchmark collections as the test best, three common thresholding methods were investigated, including rank-based thresholding (RCut), proportion-based assignments (PCut) and score-based local optimization (SCut); in addition, new variants of these methods are proposed to overcome significant problems in the existing approaches. Experimental results show that the choice of thresholding strategy can significantly influence the performance of kNN, and that the ``optimal'' strategy may vary by application. SCut is potentially better for fine-tuning but risks overfitting. PCut copes better with rare categories and exhibits a smoother trade-off in recall versus precision, but is not suitable for online decision making. RCut is most natural for online response but is too coarse-grained for global or local optimization. RTCut, a new method combining the strength of category ranking and scoring, outperforms both PCut and RCut significantly.
Copyrights may apply
» 2000 «
Yang, Yiming, Ault, Tom, Pierce, Thomas and Lattimer, Charles W. (2000): Improving text Categorization Methods For event Tracking. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2000. pp. 65-72. Available online
» 1999 «
Yang, Yiming and Liu, Xin (1999): A Re-Examination of Text Categorization Methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 1999. pp. 42-49. Available online
» 1998 «
Yang, Yiming, Pierce, Tom and Carbonell, Jaime (1998): A Study on Retrospective and On-Line Event Detection. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 1998. pp. 28-36. Available online
This paper investigates the use and extension of text retrieval and clustering techniques for event detection. The task is to automatically detect novel events from a temporally-ordered stream of news stories, either retrospectively or as the stories arrive. We applied hierarchical and non-hierarchical document clustering algorithms to a corpus of 15,836 stories, focusing on the exploitation of both content and temporal information. We found the resulting cluster hierarchies highly informative for retrospective detection of previously unidentified events, effectively supporting both query-free and query-driven retrieval. We also found that temporal distribution patterns of document clusters provide useful information for improvement in both retrospective detection and on-line detection of novel events. In an evaluation using manually labelled events to judge the system-detected events, we obtained a result of 82% in the F{sub:1} measure for retrospective detection, and a F{sub:1} value of 42% for on-line detection.
Copyrights may apply
» 1997 «
Carbonell, Jaime and Yang, Yiming (1997): Crosslingual Information Retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 1997. p. 346.
Crosslingual Information Retrieval (aka "translingual" or "multilingual" IR) is a rapidly growing area of IR, driven in part by the ease of information access across national and linguistic boundaries afforded by the internet and the web. The 1996 crosslingual (CIR) SIGIR workshop helped establish this new field, and there has been considerable progress since then in the context of TREC and in a number of new CIR techniques and comparative evaluations. This workshop offers a forum for discussion of developments and emerging issues in CIR. In particular, we expect to address: * New methods for CIR (beyond dictionary-based query translation) * The role of query expansion in CIR * The role of bilingual corpora in CIR * Can MT help in CIR, and if so how? * How should CIR performance be evaluated? * Can we set some common benchmarks and/or corpora? * What message(s) should we carry to TREC wrt CIR? * What are the greatest challenges for CIR?
Copyrights may apply
» 1996 «
Yang, Yiming and Wilbur, John (1996): Using Corpus Statistics to Remove Redundant Words in Text Categorization. In JASIST - Journal of the American Society for Information Science and Technology, 47 (5) pp. 357-369
» 1995 «
Yang, Yiming (1995): Noise Reduction in a Statistical Approach to Text Categorization. In: Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 1995. pp. 256-263. Available online
This paper studies noise reduction for computational efficiency improvements in a statistical learning method for text categorization, the Linear Least Squares Fit (LLSF) mapping. Multiple noise reduction strategies are proposed and evaluated, including: an aggressive removal of "non-informative words" from texts before training; the use of a truncated singular value decomposition to cut off noisy "latent semantic structures" during training; the elimination of non-influential components in the LLSF solution (a word-concept association matrix) after training. Text collections in different domains were used for evaluation. Significant improvements in computational efficiency without losing categorization accuracy were evident in the testing results.
Copyrights may apply
» 1994 «
Yang, Yiming and Chute, Christopher G. (1994): An Example-Based Mapping Method for Text Categorization and Retrieval. In ACM Transactions on Information Systems, 12 (3) pp. 252-277
A unified model for text categorization and text retrieval is introduced. We use a training set of manually categorized documents to learn word-category associations, and use these associations to predict the categories of arbitrary documents. Similarly, we use a training set of queries and their related documents to obtain empirical associations between query words and indexing terms of documents, and use these associations to predict the related documents of arbitrary queries. A Linear Least Squares Fit (LLSF) technique is employed to estimate the likelihood of these associations. Document collections from the MEDLINE database and Mayo patient records are used for studies on the effectiveness of our approach, and on how much the effectiveness depends on the choices of training data, indexing language, word-weighting scheme, and morphological canonicalization. Alternative methods are also tested on these data collections for comparison. It is evident that the LLSF approach uses the relevance information effectively within human decisions of categorization and retrieval, and achieves a semantic mapping of free texts to their representations in an indexing language. Such a semantic mapping leads to a significant improvement in categorization and retrieval, compared to alternative approaches.
Copyrights may apply
Yang, Yiming (1994): Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In: Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 1994. pp. 13-22. Available online
Expert Network (ExpNet) is our new approach to automatic categorization and retrieval of natural language texts. We use a training set of texts with expert-assigned categories to construct a network which approximately reflects the conditional probabilities of categories given a text. The input nodes of the network are words in the training texts, the nodes on the intermediate level are the training texts, and the output nodes are categories. The links between nodes are computed based on statistics of the word distribution and the category distribution over the training set. ExpNet is used for relevance ranking of candidate categories of an arbitrary text in the case of text categorization, and for relevance ranking of documents via categories in the case of text retrieval. We have evaluated ExpNet in categorization and retrieval on a document collection of the MEDLINE database, and observed a performance in recall and precision comparable to the Linear Least Squares Fit (LLSF) mapping method, and significantly better than other methods tested. Computationally, ExpNet has an O(NlogN) time complexity which is much more efficient than the cubic complexity of the LLSF method. The simplicity of the model, the high recall-precision rates, and the efficient computation together make ExpNet preferable as a practical solution for real-world applications.
Copyrights may apply
» 1993 «
Yang, Yiming and Chute, Christopher G. (1993): An Application of Least Squares Fit Mapping to Text Information Retrieval. In: Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 1993. pp. 281-290. Available online
This paper describes a unique example-based mapping method for document retrieval. We discovered that the knowledge about relevance among queries and documents can be used to obtain empirical connections between query terms and the canonical concepts which are used for indexing the content of documents. These connections do not depend on whether there are shared terms among the queries and documents; therefore, they are especially effective for a mapping from queries to the documents where the concepts are relevant but the terms used by article authors happen to be different from the terms of database users. We employ a Linear Least Squares Fit (LLSF) technique to compute such connections from a collection of queries and documents where the relevance is assigned by humans, and then use these connections in the retrieval of documents where the relevance is unknown. We tested this method on both retrieval and indexing with a set of MEDLINE documents which has been used by other information retrieval systems for evaluations. The effectiveness of the LLSF mapping and the significant improvement over alternative approaches was evident in the tests.
Copyrights may apply
SHOW THIS LIST ON YOUR HOMEPAGE
What do YOU think?
Give us your opinion! Do you have any comments/additions
that you would like other visitors to see?
You say:
Mar 21st, 2010
Changes to this page (author)
13 Feb 2010: Enabled abstracts to be shown on Yiming Yang's author page.09 Jul 2009: Author was edited 31 May 2009: Author was edited
29 May 2009: Author was edited
29 May 2009: Author was edited
29 May 2009: Author was edited
29 May 2009: Author was edited
29 May 2009: Author was edited
29 May 2009: Author was edited
08 Apr 2009: Author was edited
12 May 2008: Author was edited
25 Jun 2007: Author was edited
25 Jun 2007: Author was edited
25 Jun 2007: Author was edited
25 Jun 2007: Author was edited
24 Jun 2007: Author was edited
24 Jun 2007: Author was edited
24 Jun 2007: Author was edited
24 Jun 2007: Author was edited
24 Jun 2007: Author was edited
24 Jun 2007: Author was edited
24 Jun 2007: Author was edited
24 Jun 2007: Author was edited
24 Jun 2007: Author was edited
24 Jun 2007: Author was edited
28 Apr 2003: Added the author to the bibliography