The World Wide Web Conference is the global event that brings together key researchers, innovators, decision-makers, technologists, businesses, and standards bodies working to shape the Web. Organized by IW3C2 since 1994, the WWW conference is the annual opportunity for the International community to discuss and debate the evolution of the Web.
The key property of the WWWW is its universality: One must be able to access it whatever the hardware device, software platform, and network one is using, and despite the disabilities one might have, and whether oner is in a "developed" or "developing" country; it must support information of any language, culture, quality, medium, and field without discrimination so that a hypertext link can go anywhere; it must support information intended for people, and that intended for machine processing. The Web architecture incorporated various choices which support these axes of universality. Currently the architecture and the principles are being exploited in the recent Mobile Web initiative in W3C to promote content which can be accessed optimally from conventional computers and mobile devices. New exciting areas arise every few months as possible Semantic Web flagship applications. As new areas burst forth, the fundamental principles remain important and are extended and adjusted. At the same time, the principles of openness and consensus among international stakeholders which the WWW consortium employs for new technology are adjusted, but ever-important.
We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page.
Describing XML Namespaces is an open issue for many users of XML technologies, and even though namespaces are one of the foundations of XML, there is no generally accepted and widely used format for namespace descriptions. We present a framework for describing namespaces based on GRDDL using a controlled vocabulary. Using this frame-work, namespace descriptions can be easily generated, harvested and published in human- or machine-readable form.
In this short paper we introduce Helios, a flexible and efficient open source meta-search engine. Helios currently runs on the top of 18 search engines (in Web, Books, News, and Academic publication domains), but additional search engines can be easily plugged in. We also report some performance mesured during its development.
Proportional slowdown differentiation (PSD) aims to maintain slowdown ratios between different classes of clients according to their pre-specified differentiation parameters. In this paper, we design a feedback controller to allocate processing rate on Internet servers for PSD. In this approach, the processing rate of a class is adjusted by an integral feedback controller according to the difference between the target slowdown ratio and the achieved one. The initial rate class is estimated based on predicted workload using queueing theory. We implement the feedback controller in an Apache Web server. The experimental results under various environments demonstrate the controller's effectiveness and robustness.
In this paper, we propose a Web based agent system called MiSpider, which provides intelligent web services on web browsers. MiSpider enables users to use agents on existing browsers. Users can use MiSpider all over the world only to access the Internet. MiSpider Agent has persistency, and agents condition doesn't change if users change a browsing page. Moreover, agents have a message passing skill to communicate among the agents.
While several hierarchical classification methods have been applied to web content, such techniques invariably rely on a pre-defined taxonomy of documents. We propose a new technique that extracts a suitable hierarchical structure automatically from a corpus of labeled documents. We show that our technique groups similar classes closer together in the tree and discovers relationships among documents that are not encoded in the class labels. The learned taxonomy is then used along with binary SVMs for multi-class classification. We demonstrate the efficacy of our approach by testing it on the 20-Newsgroup dataset.
In this paper, we propose a web browsing support system, called WPM, which provides marking and anchoring functions on ordinary web browsers. WPM users can mark words and phrases on web pages by using their browsers without any extra plug-ins like similar systems, and can anchor words to refer them later. WPM makes it possible to carry out marking to the existing Web page so that marking carried out to paper. By changing character decoration partially, the text is indicated by emphasis and improve readability. WPM is implemented using proxy agent. This system can be used in everyday browsing, without a user being conscious of a system by using a proxy.
We describe a simple approach to "point-in-time" file sharing based on time expiring web links and personal webservers. This approach to file sharing is useful in environments where instant messaging clients are varied and don't necessarily support (compatible) file transfer protocols. We discuss the features of such an approach along with a successfully deployed implementation now in wide use throughout the IBM corporation.
Some recent initiatives try to take profit from RDF to make XML documents interoperate at the semantic level. Ontologies are used to establish semantic connections among XML languages, and some mechanisms have been defined to query them with natural XML query languages like XPath and XML Query. Generally structure-mapping approaches define a simple translation between trivial XPath expressions and some RDF query language like RDQL; however some XPath constructs cannot be covered in a structure-mapping strategy. In contrast, our work takes the model-mapping approach, respectful with node order, that allows mapping all XPath axis. The obtained XPath implementation has the properties of schema-awareness and IDREF-awareness, so it can be used to exploit inheritance hierarchies defined in one or more XML schemas.
Being able to determine the provenience of statements is a fundamental step in any SW trust modeling. We propose a methodology that allows signing of small groups of RDF statements. Groups of statements signed with this methodology can be safely inserted into any existing triple store without the loss of provenance information since only standard RDF semantics and constructs are used. This methodology has been implemented and is both available as open source library and deployed in a SW P2P project.
The semantic web is expected to have an impact at least as big as that of the existing HTML based web, if not greater. However, the challenge lays in creating this semantic web and in converting existing web information into the semantic paradigm. One of the core technologies that can help in migration process is automatic markup, the semantic markup of content, providing the semantic tags to describe the raw content. This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labor by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (Trainable Extraction Grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (Stochastic Context Free Grammar) based extraction language, and training them using an annotated corpus. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amount of training data. We also demonstrate the robustness of our system under conditions of poor training data quality. This makes the system very suitable for converting legacy web pages to semantic web pages.
We describe GalaTex, the first complete implementation of XQuery Full-Text, a W3C specification that extends XPath 2.0 and XQuery 1.0 with full-text search. XQuery Full-Text provides composable full-text search primitives such as keyword search, Boolean queries, and keyword-distance predicates. GalaTex is intended to serve as a reference implementation for XQuery Full-Text and as a platform for addressing new research problems such as scoring full-text query results, optimizing XML queries over both structure and text, and evaluating top-k queries on scored results. GalaTex is an all-XQuery implementation initially focused on completeness and conformance rather than on efficiency. We describe its implementation on top of Galax, a complete XQuery implementation.
How do people decide which health websites to trust and which to reject? Thirteen participants all diagnosed with hypertension were invited to search for information and advice relating to hypertension. Participants took part in a four-week study engaging in both free and directed web searches. A content analysis of the group discussions revealed support for a staged model of trust in which mistrust or rejection of websites is based on design factors and trust or selection of websites is based on content factors such as source credibility and personalization. A number of guidelines for developing trust in health websites are proposed.
Previous work on structural joins mostly focuses on maintaining offline indexes on disks. Most of them also require the elements in both sets to be sorted. In this paper, we study an on-the-fly, in-memory indexing approach to structural joins. There is no need to sort the elements or maintain indexes on disks. We identify the similarity between the structural join problem and the stabbing query problem, and extend a main memory-based indexing technique for stabbing queries to structural joins.
Hyperlinks are an essential feature of the World Wide Web, highly responsible for its success. XLink improves on HTML's linking capabilities in several ways. In particular, links after XLink can be "out-of-line" (i.e., not defined at a link source) and collected in (possibly several) linkbases, which considerably ease building complex link structures. Modeling of link structures and processing of linkbases under the Web's "open world linking" are aspects neglected by XLink. Adding a notion of "interface" to XLink, as suggested in this work, considerably improves modeling of link structures. When a link structure is traversed, the relevant linkbase(s) might become ambiguous. We suggest three linkbase management modes governing the binding of a linkbase to a document to resolve this ambiguity.
Term weighting scheme, which has been used to convert the documents as vectors in the term space, is a vital step in automatic text categorization. In this paper, we conducted comprehensive experiments to compare various term weighting schemes with SVM on two widely-used benchmark data sets. We also presented a new term weighting scheme tf-rf to improve the term's discriminating power. The controlled experimental results showed that this newly proposed tf-rf scheme is significantly better than other widely-used term weighting schemes. Compared with schemes related with tf factor alone, the idf factor does not improve or even decrease the term's discriminating power for text categorization.
This paper proposes a model for short-term content adaptation whose aim is to satisfy the contingent needs of users by adjusting the information a web-application provides on the basis of a short-term user profile. The mathematical model results in the design of an adaptive filter that profiles users by observing their queries to the application and that adjusts the answers of the application according to the inferred user needs. Also, the mathematical model ensures the correctness of the filter, that is, the filter is guaranteed to exhibit a coherent short-term adaptive behaviour.
Today's Virtual Environment (VE) systems share a number of issues with the HTML-based World Wide Web. Their content is usually designed for presentation to humans, and thus is not suitable for machine access. This is complicated by the large number of different data models and network protocols in use. Accordingly, it is difficult to develop VE software, such as agents, services, and tools. In this paper we adopt the Semantic Web idea to the field of virtual environments. Using the Resource Description Framework (RDF) we establish a machine-understandable abstraction of existing VE systems -- the Semantic Virtual Environments (SVE). On this basis it is possible to develop system-independent software, which could even operate across VE system boundaries.
Feature models are widely used in domain engineering to capture common and variant features among systems in a particular domain. However, the lack of a widely-adopted means of precisely representing and formally verifying feature models has hindered the development of this area. This paper presents an approach to modeling and verifying feature diagrams using Semantic Web ontologies.
Ontology mapping is the task of finding semantic relationships between entities (i.e. concept, attribute and relation) of two ontologies. In the existing literatures, many (semi-)automatic approaches have found considerable interest by combining several mapping strategies (namely multi-strategy mapping). However, experiments show that multi-strategy based mapping does not always outperform its single-strategy counterpart. We here mainly consider the following questions: For a new, unseen mapping task, should one use a multi-strategy or a single-strategy? And if the task is suitable for multi-strategy, then which strategies should be selected in the combined scenario? This paper proposes an approach of multiple strategies detection for ontology mapping. The results obtained so far show that multi-strategy detection improves both on precision and recall significantly.
Some work showed that segmenting web pages into "semantic independent" blocks could help to improve the whole page retrieval. One key and unexplored issue is how to combine the block importance and relevance to a given query. In this poster, we first propose an automatic way to measure block importance to improve retrieval. After that, user information need is also concerned to refine block importance for different users.
Autonomics or self-reorganization becomes pertinent for web-sites serving a large number of users with highly varying workloads. An important component of self-adaptation is to model the behaviour of users and adapt accordingly. This paper proposes a learning-automata based technique for model discovery. User access patterns are used to construct an FSM model of user behaviour that in turn is used for prediction and prefetching. The proposed technique uses a generalization algorithm to classify behaviour patterns into a small number of generalized classes. It has been tested on both synthetic and live data-sets and has shown a prediction hit-rate of up to 89% on a real web-site.
Web based applications offer a mainstream channel for businesses to manage their activities. We model such business activity in a grammar-based framework. The Backus Naur form notation is used to represent the syntax of a regular grammar corresponding to Web log patterns of interest. Then, a deterministic finite state machine is used to parse Web logs against the grammar. Detected tasks are associated with metadata such as time taken to perform the activity, and aggregated along relevant corporate dimensions.
The correctness of the Z semantics of OWL is the theoretical foundation of using software engineering techniques to verify Web ontologies. As OWL and Z are based on different logical systems, we use institutions to represent their underlying logical systems and use institution morphisms to prove the correctness of the Z semantics for OWL DL.
In this paper, we propose a simple framework to characterize the switching behavior between search engines based on click streams. We segment users into a number of categories based on their search engine usage during two adjacent time periods and construct the transition probability matrix across these usage categories. The principal eigenvector of the transposed transition probability matrix represents the limiting probabilities, which are proportions of users in each usage category at steady state. We experiment with this framework using click streams focusing on two search engines: one with a large market share and the other with a small market share. The results offer interesting insights into search engine switching. The limiting probabilities provide empirical evidence that small engines can still retain its fair share of users over time.
We evaluate three different relevance feedback (RF)algorithms, Rocchio, Robertson/Sparck-Jones (RSJ) and Bayesian, in the context of Web search. We use a target-testing experimental procedure whereby a user must locate a specific document. For user relevance feedback, we consider all possible user choices of indicating zero or more relevant documents from a set of 10 displayed documents. Examination of the effects of each user choice permits us to compute an upper-bound on the performance of each RF algorithm. We ind that there is a significant variation in the upper-bound performance o the three RF algorithms and that the Bayesian algorithm approaches the best possible.
In this paper, we present a novel association-based method called SAT-MOD for text classification. SAT-MOD views a sentence rather than a document as a transaction, and uses a novel heuristic called MODFIT to select the most significant itemsets for constructing a category classifier. The effectiveness of SAT-MOD has been demonstrated comparable to well-known alternatives such as LinearSVM and much better than current document-level words association based methods on the Reuters corpus.
A major factor in the effectiveness of the interaction which users have with Web applications is the ease with which they can locate information and functionality which they are seeking. Effective design is however complicated by the multiple design purposes and diverse users which Web applications typically support. In this paper we describe a navigational design method aimed at optimising designs through minimizing navigational entropy. The approach uses a theoretical navigational depth for the various information and service components to moderate a nested hierarchical clustering of the content.
The Adaptive Web is a new research area addressing the personalization of the Web experience for each user. In this paper we propose a new high-level model for the specification of Web applications that take into account the manner users interact with the application for supplying appropriate contents or gathering profile data. We therefore consider entire processes (rather than single properties) as smallest information units, allowing for automatic restructuring of application components. For this purpose, a high-level Event-Condition-Action (ECA) paradigm is proposed, which enables capturing arbitrary (and timed) clicking behaviors.
An approach to detection of phishing webpages based on visual similarity is proposed, which can be utilized as a part of an enterprise solution for anti-phishing. A legitimate webpage owner can use this approach to search the Web for suspicious webpages which are visually similar to the true webpage. A webpage is reported as a phishing suspect if the visual similarity is higher than its corresponding preset threshold. Preliminary experiments show that the approach can successfully detect those phishing webpages for online use.
We examine the difference and similarities between two on-line computer science citation databases DBLP and CiteSeer. The database entries in DBLP are inserted manually while the CiteSeer entries are obtained autonomously. We show that the CiteSeer database contains considerably fewer single author papers. This bias can be modeled by an exponential process with intuitive explanation. The model permits us to predict that the DBLP database covers approximately 30% of the entire literature of Computer Science.
Organizing large document collections for finding information easily and quickly has always been an important user requirement. This paper describes a flexible and powerful dynamic folder technology, called Hubble, which exploits XML semantics to precisely categorize XML documents into categories or folders.
This paper proposes an extension of the XSL-FO standard which allows the specification of an unlimited number of arbitrarily shaped page regions. These extensions are built on top of XSL-FO 1.1 to enable flow content to be laid out into arbitrary shapes and allowing for page layouts currently available only to desktop publishing software. Such a proposal is expected to leverage XSL-FO towards usage as an enabling technology in the generation of content intended for personalized printing.
How to effectively allocate system resource to meet the Service Level Agreement (SLA) of Web servers is a challenging problem. In this paper, we propose an improved scheme for autonomous timing performance control in Web servers under highly dynamic traffic loads. We devise a novel delay regulation technique called Queue Length Model Based Feedback Control utilizing server internal state information to reduce response time variance in presence of bursty traffic. Both simulation and experimental studies using synthesized workloads and real-world Web traces demonstrate the effectiveness of the proposed approach.
Automatic extraction of semantic information from text and links in Web pages is key to improving the quality of search results. However, the assessment of automatic semantic measures is limited by the coverage of user studies, which do not scale with the size, heterogeneity, and growth of the Web. Here we propose to leverage human-generated metadata -- namely topical directories -- to measure semantic relationships among massive numbers of pairs of Web pages or topics. The Open Directory Project classifies millions of URLs in a topical ontology, providing a rich source from which semantic relationships between Web pages can be derived. While semantic similarity measures based on taxonomies (trees) are well studied, the design of well-founded similarity measures for objects stored in the nodes of arbitrary ontologies (graphs) is an open problem. This paper defines an information-theoretic measure of semantic similarity that exploits both the hierarchical and non-hierarchical structure of an ontology. An experimental study shows that this measure improves significantly on the traditional taxonomy-based approach. This novel measure allows us to address the general question of how text and link analyses can be combined to derive measures of relevance that are in good agreement with semantic similarity. Surprisingly, the traditional use of text similarity turns out to be ineffective for relevance ranking.
As more and more Web services are deployed, Web service's discovery mechanisms become essential. Similar services can have quite different QoS behaviors. For service selection and management purpose, it is necessary to clearly specify QoS constraints and metrics definitions for Web services. We investigate on the semantic QoS specification and introduce our design principles on it. Based on the specification refinement and conformance, we introduce the QoS matchmaking algorithm with multiple matching degrees. The matchmaking prototype is designed to prove the feasibility. Well-defined Metrics can be further utilized by measurement organizations to monitor and evaluate the promised service level objectives.
By far, the support vector machines (SVM) achieve the state-of-the-art performance for the text classification (TC) tasks. Due to the complexity of the TC problems, it becomes a challenge to systematically develop classifiers with better performance. We try to attack this problem by ensemble methods, which are often used for boosting weak classifiers, such as decision tree, neural networks, etc., and whether they are effective for strong classifiers is not clear.
An unstructured peer network application was proposed to address the query forwarding problem of distributed search engines and scalability limitations of centralized search engines. Here we present novel techniques to improve local adaptive routing, showing they perform significantly better than a simple learning scheme driven by query response interactions among neighbors. We validate prototypes of our peer network application via simulations with 500 model users based on actual Web crawls. We finally compare the quality of the results with those obtained by centralized search engines, suggesting that our application can draw advantages from the context and coverage of the peer collective.
This paper describes a medium, called Interactive e-Hon, for helping children to understand contents from the Web. It works by transforming electronic contents into an easily understandable "storybook world." In this world, easy-to-understand contents are generated by creating 3D animations that include contents and metaphors, and by using a child-parent model with dialogue expression and a question-answering style comprehensible to children.
In this paper a TV recommender system called AVATAR (AdVAnce Telematic search of Audiovisual contents by semantic Reasoning) is presented. This tool uses the experience gained in the field of the Semantic Web to personalize the TV programs shown to the end users. The main contribution of our system is a process of semantic reasoning carried out on the descriptions of the TV contents -- provided by means of metainformation -- and on the viewer preferences -- contained in personal profiles. Such process allows to diversify the offered suggestions maintaining the personalization, given that the aim is to find contents appealing for the users, which are related semantically to their programs of interest. Here the framework proposed for this reasoning is introduced, by including (i) the OWL ontology chosen to represent the knowledge of our application domain, (ii) the organization of the user profiles, (iii) the query language LIKO, which is intended to browse the ontology and (iv) the semantic relations inferred from the system knowledge base.
WAND is a meta-data management system that provides a file-system tree for users of an internet based P2P network. The tree is robust and retains its structure even when nodes (peers) enter and leave the network. The robustness is based on a concept of virtual folders that are automatically created to retain paths to lower level folders whenever a node hosting a higher-level folder moves away. Other contributions of the WAND system include its novel approach towards managing root directory information and handling network partitions.
Reactivity on the Web is an emerging issue. The capability to automatically react to events (such as updates to Web resources) is essential for both Web services and Semantic Web systems. Such systems need to have the capability to detect and react to complex, real life situations. This presentation gives flavours of the high-level language XChange, for programming reactive behaviour on the Web.
Learning Web contents requires learners not only to navigate the Web pages to construct their own knowledge from the contents learned at and between the pages, but also to control their own navigation and knowledge construction processes. However, it is not so easy to control the learning processes. The main issue addressed is how to help learners learn how to learn with Web contents. This paper discusses how to design a meta-learning tool.
In this paper, we describe a user-centric Internet usage data processing platform. Raw usage data is collected using a software probe installed on a panel of Internet users' workstations. It is then processed by our platform. The transformation of raw usage data into qualified and usable information by Internet usage sociology researchers means setting up a series of relatively complex processes using quite a wide variety of resources. We use a combination of ad hoc rule-based systems and external resources to qualify the visited Web pages. We also implemented topological and temporal indicators in order to describe the dynamics of Web sessions.
Multichannel publication of multimedia presentations poses a significant challenge on the generic description of the presentation content and the system necessary to convert these descriptions into final-form presentations. We present a solution based on the XiMPF document model and a component based system architecture.
Currently, fault management in Web Services orchestrating multiple suppliers relies on a local analysis, that does not span across individual services, thus limiting the effectiveness of recovery strategies. We propose to address this limitation by employing Model-Based Diagnosis to enhance fault analysis. In our approach, a Diagnostic Web Service is added to the set of Web Services providing the overall service, and acts as a supervisor of their execution, by identifying anomalies and explaining them in terms of faults to be repaired.
In the paper, we present an approach to mining a directed social network from a message board on the Internet where vertices denote individuals and directed links denote the flow of influence. The influence is measured based on propagating terms among individuals via messages. The distance with respect to contextual similarity between individuals is acquired since the influence indicates the degree of their shared interest represented as terms.
A profiling adversary is an adversary whose goal is to classify a population of users into categories according to messages they exchange. This adversary models the most common privacy threat against web based communication. We propose a new encryption scheme, called stealth encryption, that protects users from profiling attacks by concealing the semantic content of plaintext while preserving its grammatical structure and other non-semantic linguistic features, such as word frequency distribution. Given English plaintext, stealth encryption produces ciphertext that cannot efficiently be distinguished from normal English text (our techniques apply to other languages as well).
This paper proposes a method of generating XSLT scripts, which support the fast transformation of XML documents, given one-to-one matching relationships between leaf nodes of XML schemas. The proposed method enhances the transformation speed of generated XSLT scripts through reducing template calls. Experimental results show that the proposed method has generated XSLT scripts that support the faster transformation of XML documents, compared with previous works.
We report on a study of topic dynamics for pages visited by a sample of people using MSN Search. We examine the predictive accuracies of probabilistic models of topic transitions for individuals and groups of users. We explore temporal dynamics by comparing the accuracy of the models for predicting topic transitions at increasingly distant times in the future. Finally, we discuss directions for applying models of search topic dynamics.
Based on the type of collaborative objects, a collaborative filtering (CF) system falls into one of two categories: item-based CF and user-based CF. Clustering is the basic idea in both cases, where users or items are classified into user groups where users share similar preference or item groups where items have similar attributes or characteristics. Observing the fact that in user-based CF each user community is characterized by a Gaussian distribution on the ratings for each item and the fact that in item-based CF the ratings of each user in item community satisfy a Gaussian distribution, we propose a method of probabilistic model estimation for CF, where objects (user or items) are classified into groups based on the content information and ratings at the same time and predictions are made considering the Gaussian distribution of ratings. Experiments on a real-world data set illustrate that our approach is favorable.
Taxonomies of the Web typically have hundreds of thousands of categories and skewed category distribution over documents. It is not clear whether existing text classification technologies can perform well on and scale up to such large-scale applications. To understand this, we conducted the evaluation of several representative methods (Support Vector Machines, k-Nearest Neighbor and Naive Bayes) with Yahoo! taxonomies. In particular, we evaluated the effectiveness/efficiency tradeoff in classifiers with hierarchical setting compared to conventional (flat) setting, and tested popular threshold tuning strategies for their scalability and accuracy in large-scale classification problems.
Automatically classifying the Web directories is an effective way to manage Web information. However, our experiments showed that the state-of-the-art text classification technologies could not lead to acceptable performance in this task. Due to our analysis, the main problem is the lack of effective training data in rare categories of Web directories. To tackle this problem, we proposed a novel technology named Site Abstraction to synthesize new training examples from the website of the existing training document. The main idea is to propagate features through parent-child relationship in the sitemap tree. Experiments showed that our method significantly improved the classification performance.
The need for e-learning systems that support a diverse set of pedagogical requirements has been identified as an important issue in web-based education. Until now, significant R&D effort has been devoted aiming towards web-based educational systems tailored to specific pedagogical approaches. The most advanced of them are based on the IEEE Learning Technology Systems Architecture and use standardized content structuring based on the ADL Sharable Content Object Reference Model in order to enable sharing and reusability of the learning content. However, sharing of learning activities among different web-based educational systems still remains an open issue. The open question is how web-based educational systems should be designed in order to enable reusing and repurposing of learning activities. In this paper we propose an authoring system, refered to as ASK-LDT that utilizes the Learning Design principles to provide the means for designing activity-based learning services and systems.
Previous work on content extraction utilized various heuristics such as link to text ratio, prominence of tables, and identification of advertising. Many of these heuristics were associated with "settings", whereby some heuristics could be turned on or off and others parameterized by minimum or maximum threshold values. A given collection of settings -- such as removing table cells with high linked to non-linked text ratios and removing all apparent advertising -- might work very well for a news website, but leave little or no content left for the reader of a shopping site or a web portal We present a new technique, based on incrementally clustering websites using search engine snippets, to associate a newly requested website with a particular "genre", and then employ settings previously determined to be appropriate for that genre, with dramatically improved content extraction results overall.
Constructing and maintaining semantic mappings are necessary but troublesome in data sharing systems. While most current work focuses on seeking automated techniques to solve this problem, this paper proposes a combination model for constructing extensible mappings between XML schemas. In our model, complex global mappings are constructed by first defining simple atomic mappings for each target schema element, and then combining them using a few basic operators. At the same time, we provide automated support for constructing such combined mappings.
Finding all the occurrences of a twig pattern in an XML database is a core operation for efficient evaluation of XML queries. A number of algorithms have been proposed to process a twig query based on region encoding. In this paper, based on a novel labeling scheme: extended Dewey, we propose a novel and efficient holistic twig join algorithm, namely TJFast. Compared to previous work, our algorithm only needs to access the labels of leaf query nodes. We report our experimental results to show that our algorithms are superior to previous approaches in terms of the number of elements scanned and query performance.
Security is one of the major concerns when developing mission-critical business applications, and this concern motivated the Web Services Security specifications. However, the existing tools to configure the security properties of Web Services give a technology-oriented view; only assisting in choosing data to encrypt and the encryption algorithms to use. A user must manually bridge the gap between the security requirements and the configuration, which could cause extra configuration costs and lead to potential misconfiguration hazards. To ease this situation, we came up with refining security requirements from business to technology, leveraging the concepts of Service-Oriented Architecture (SOA) and Model-Driven Architecture (MDA). Security requirements are gradually transformed to more detailed ones or countermeasures by bridging the gap between them by using best practice patterns.
This paper proposes to extend a previous work, The Effect of the Back Button in a Random Walk: Application for PageRank [5]. We introduce an enhanced version of the PageRank algorithm using a realistic model for the Back button, thus improving the random surfer model. We show that in the special case where the history is bound to an unique page (you cannot use the Back button twice in a row), we can produce an algorithm that does not need much more resources than a standard PageRank. This algorithm, BackRank, can converge up to 30% faster than a standard PageRank and suppress most of the drawbacks induced by the existence of pages without links.
In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Using these logical information units has been shown to improve the performance of many web algorithms. In this paper, we focus on a type of logical information units called compound documents. We argue that the ability to identify compound documents can improve information retrieval, automatic metadata generation, and navigation on the Web. We propose a unified framework for identifying the boundaries of compound documents, which combines both structural and content features of constituent web pages. The framework is based on a combination of machine learning and clustering algorithms, with the former algorithm supervising the latter one. Experiments on a collection of educational web sites show that our approach can reliably identify most of the compound documents on these sites.
In an environment of distributed text collections, the first step in the information retrieval process is to identify which of all available collections are more relevant to a given query and which should thus be accessed to answer the query. We address the challenge of collection selection when there is full or partial overlap between the available text collections, a scenario which has not been examined previously despite its real-world applications. To that end, we present COSCO, a collection selection approach which uses collection-specific coverage and overlap statistics. We describe our experimental results which show that the presented approach displays the desired behavior of retrieving more new results early on in the collection order, and performs consistently and significantly better than CORI, previously considered to be one of the best collection selection systems.
This paper proposes an effective Web services (WS) transaction management framework to automatically manage inconsistencies occurred by relaxing isolation of WS transactions.
WS-* specifications cover a variety of issues ranging from security and reliability to transaction support in web services. However, these specifications do not address web service compositions. On the other hand, BPEL as the future standard web service composition language allows the specification of the functional part of the composition as a business process but it fails short in expressing non-functional properties such as security, reliability and persistence. In this paper, we propose an approach for the transparent integration of technical concerns in web service compositions. Our approach is driven by the analogy between web services and software components and is inspired from server-side component models such as Enterprise Java Beans. The main components of our framework are the process container, the middleware services and the deployment descriptor.
This paper proposes the AN.P2P architecture to facilitate efficient peer-to-peer content delivery with heterogeneous presentation requirements. In general, the AN.P2P enables a peer to deliver the original content objects and an associated workflow to other peers. The workflow is composed of content adaptation tasks. Hence, the recipient can reuse the original object to generate appropriate presentations for other peers.
With the page explosion of WWW, how to cover more useful information with limited storage and computation resources becomes more and more important in web IR research. Using web page non-content feature analysis, we proposed a clustering-based method to select high quality pages from the whole page set. Although the result page set contains only 44.3% of the whole collection, it is related with more than 98% of links and covers about 90% of key information. Link property and retrieval affects are also observed and experiment results show that key resource selection method is more suitable for the job of data cleansing and the result page set outperforms the whole collection by smaller size and better retrieval performance.
Rapid pervasion of the web into users' daily lives has put much importance on capturing location-specific information on the web, due to the fact that most human activities occur locally around where a user is located. This is especially true in the increasingly popular mobile and local search environments. Thus, how to correctly and effectively detect locations from web resources has become a key challenge to location-based web applications. In this paper, we first explicitly distinguish the locations of web resources into three types to cater to different application needs: 1) provider location; 2) content location; and 3) serving location. Then we describe a unified system that computes each of the three locations, employing a set of algorithms and different geographic sources.
This paper investigates basic research issues that need to be addressed for developing an architecture that enables repurposing of learning objects in a flexible way. Currently, there are a number of Learning Object Content Models (e.g. the SCORM Content Aggregation Model) that define learning objects and their components in a more or less precise way. However, these models do not allow repurposing of fine-grained components (sentences, images). We developed an ontology-based solution for content repurposing. The ontology is a solid basis for an architecture that will enable on-the-fly access to learning object components and that will facilitate repurposing these components.
Nowadays, Web activities have become daily practice for people. It is therefore essential to organize and present this continuously increasing Web information in a more usable manner. In this paper, we developed a novel approach to reorganize personal Web information as a topic-oriented interface. In our approach, we proposed to utilize anchor, title and URL information to represent content information for the browsed Web pages rather than the content body. Furthermore, we explored three methods to organize personal Web information: 1) top-down statistical clustering; 2) salience phrase based clustering; and 3) support vector machine (SVM) based classification. Finally, we conducted a usability study to verify the effectiveness of our proposed solution. The experimental results demonstrated that users could visit the pages that have been browsed previously more easily with our approach than existing solutions.
We propose a new browsing system called "Web2Talkshow". It transforms declarative-based web content into humorous dialog-based TV-program-like content that is presented through cartoon animation and synthesized speech. The system does this based on keywords in the original web content. Web2Talkshow enable users to get desired web content easily, pleasantly, and in a user-friendly way while being able to continue working on other tasks. Thus, using it will be much like watching TV.
Web accessibility consists on a set of checkpoints which are rather expensive to evaluate or to spot. However, using W3C technologies, this cost can be clearly minimized. This article presents a W3C formalized rule-set version for automatable checkpoints from WCAG 1.0.
In this paper, we describe a method for understanding the function of web elements. It classifies web elements into five functional categories: Content (C), Related Links (R), Navigation and Support (N), Advertisement (A) and Form (F). We construct five graphs for a web page, and each graph is designed such that most of the probability mass of the stationary distribution is concentrated in nodes belong to its corresponding category. We perform random walks on these graphs until convergence and classify based on its rank value in different graphs. Our experiment shows that the new method performed very well comparing to basic machine learning methods.
It has been a few years since the semantic Web was initiated by W3C, but its status has not been quantitatively measured. It is crucial to understand the status at this early stage, for researchers, developers and administrators to gain insight into what will come in this field. The objective of our work is to quantitatively measure and present the status of the semantic Web. We conduct a longitudinal study on the semantic Web pages to track trends in the use of semantic markup languages. This paper presents early results of this study with two historical data sets from October 2003 and October 2004. Our results show that while it is very early stage of semantic Web adoption, its growth outpaces that of the entire Web for the period. Also, RDF (Resource Description Framework) has dominated among semantic markup languages, taking about 98% of all semantic pages on the Web. It has been used in a variety of metadata annotation applications. This study shows that the most popular application is RSS (RDF Site Summary) for syndicating news and blogs, which takes more than 60% of all semantic Web pages. It also shows that the use of OWL (Web Ontology Language) which was recommended by W3C in early 2004 has been increased 900% for the period.
As the Web is increasingly used as a platform for heterogeneous applications, we are faced with new requirements to authentication, authorization and identity management. Modern architectures have to control access not only to single, isolated systems, but to whole business-spanning federations of applications and services. This task is complicated by the diversity of today's specifications concerning e.g. privacy, system integrity and distribution in the web. As an approach to such problems, in this paper, we introduce a solution catalogue of reusable building blocks for Identity and Access Management (IAM). The concepts of these blocks have been realized in a configurable system that supports IAM solutions for Web-based applications.
XQBE (XQuery By Example, [1]), a visual dialect of XQuery, uses hierarchical structures to express transformations between XML documents. XSLT, the standard transformation language for XML, is increasingly popular among programmers and Web developers for separating the application and presentation layers of Web applications. However, its syntax and its rule-based execution paradigm are rather intricate, and the number of XSLT experts is limited; the availability of easier "dialects" could be extremely valuable and may contribute to the adoption of XML for developing data-centered Web applications and services. With this motivation in mind, we adapted XQBE to serve as a visual interface for expressing XML-to-XML transformations and generate the XSLT code that performs such transformations.
We exploit the recently proposed Concept Abduction inference service in Description Logics to solve Concept Covering problems. We propose a framework and polynomial greedy algorithm for semantic based automated Web service orchestration, fully compliant with Semantic Web technologies. We show the proposed approach is able to deal with not exact solutions, computing an approximate orchestration with respect to an agent request modeled a subset of OWL-DL.
Order-based queries over XML data include XPath navigation axes such as following-sibling and following. In this paper, we present holistic algorithms that evaluate such order-based queries. An experimental comparison with previous approaches shows the performance benefits of our algorithms.
Markup languages, representations, schemas, and tools have significantly increased the ability for organizations to share their information. Languages, such as the Extensible Markup Language (XML), provide a vehicle for organizations to represent information in a common, machine-interpretable format. Although these approaches facilitate the collaboration and integration of inter-organizational information, the reality is that the schema representations behind these languages are reasonably difficult to learn, and automated schema integration (without semantics or ontology mappings) is currently an open problem. In this paper, we introduce an architecture and service-oriented infrastructure to facilitate organizational collaboration that combines the push features of the publish/subscribe protocol with storage of distributed registry capabilities.
The capability to change user agent while working is starting to appear in state of the art mobile computing due to the proliferation of different kinds of devices, ranging from personal wireless devices to desktop computers, and to the consequent necessity of migrating working sessions from a device to a more apt one. Research results related to the hand-off at low level are not sufficient to solve the problem at application level. The paper presents a scheme for session hand-off in Web applications which, by exploiting a proxy-based architecture, is able to work without interventions on existing code.
While the idea that querying mechanisms for complex relationships (otherwise known as Semantic Associations) should be integral to Semantic Web search technologies has recently gained some ground, the issue of how search results will be ranked remains largely unaddressed. Since it is expected that the number of relationships between entities in a knowledge base will be much larger than the number of entities themselves, the likelihood that Semantic Association searches would result in an overwhelming number of results for users is increased, therefore elevating the need for appropriate ranking schemes. Furthermore, it is unlikely that ranking schemes for ranking entities (documents, resources, etc.) may be applied to complex structures such as Semantic Associations. In this paper, we present an approach that ranks results based on how predictable a result might be for users. It is based on a relevance model SemRank, which is a rich blend of semantic and information-theoretic techniques with heuristics that supports the novel idea of modulative searches, where users may vary their search modes to effect changes in the ordering of results depending on their need. We also present the infrastructure used in the SSARK system to support the computation of SemRank values for resulting Semantic Associations and their ordering.
A novel framework for eliciting a user's conceptualization based on an ontology-driven dialog is presented here. It has been integrated in an RDF/OWL-based architecture of an adaptive learning content management system. The implemented framework is illustrated with an application scenario to deal with the cold start problem and to enable tailoring the system's behavior to the needs of each individual user.
We present a system that gathers and analyzes online discussion as it relates to consumer products. Weblogs and online message boards provide forums that record the voice of the public. Woven into this discussion is a wide range of opinion and commentary about consumer products. Given its volume, format and content, the appropriate approach to understanding this data is large-scale web and text data mining. By using a wide variety of state-of-the-art techniques including crawling, wrapping, text classification and computational linguistics, online discussion is gathered and annotated within a framework that provides for interactive analysis that yields marketing intelligence for our customers.
We present the design of Dynabot, a guided Deep Web discovery system. Dynabot's modular architecture supports focused crawling of the Deep Web with an emphasis on matching, probing, and ranking discovered sources using two key components: service class descriptions and source-biased analysis. We describe the overall architecture of Dynabot and discuss how these components support effective exploitation of the massive Deep Web data available.
We describe a framework of bootstrapped hypothesis testing for estimating the confidence in one web search engine outperforming another over any randomly sampled query set of a given size. To validate this framework, we have constructed and made available a precision-oriented test collection consisting of manual binary relevance judgments for each of the top ten results of ten web search engines across 896 queries and the single best result for each of those queries. Results from this bootstrapping approach over typical query set sizes indicate that examining repeated statistical tests is imperative, as a single test is quite likely to find significant differences that do not necessarily generalize. We also find that the number of queries needed for a repeatable evaluation in a dynamic environment such as the web is much higher than previously studied.
We propose a set of optimization techniques, collectively called Wireless SOAP (WSOAP), to compress SOAP messages transmitted across a wireless link. The Name Space Equivalency technique rests on the observation that exact recovery of compressed messages is not required at the receiver; an equivalent form suffices. The WSDL Aware Encoding technique obtains further savings by utilizing knowledge of the underlying WSDL by means of an offline protocol we define. We summarize the design, implementation and performance of our Wireless SOAP prototype, and show that Wireless SOAP can reduce message sizes by 3x-12x compared to SOAP.
The Web has established itself as the largest public data repository ever available. Even though the vast majority of information on the Web is formatted to be easily readable by the human eye, "meaningful information" is still largely inaccessible for the computer applications. In this paper we present the METEOR system which utilizes various presentation and linkage regularities from referral lists of various sorts to automatically separate and extract metadata and instance information. Experimental results for the university domain with 12 computer science department Web sites, comprising 361 individual faculty and course home pages indicate that the performance of the metadata and
We propose extensions to existing web protocols that allow proofs of authenticity of HTTP server responses, whether or not the HTTP server is under the control of the publisher. These extensions protect users from content that may be substituted by malicious servers, and therefore have immediate applications in improving the security of web caching, mirroring, and relaying systems that rely on untrusted machines [2,4]. Our proposal relies on Merkle trees to support 200 and 404 response authentication while requiring only a single cryptographic hash of trusted data per repository. While existing web protocols such as HTTPS can provide authenticity guarantees (in addition to confidentiality), HTTPS consumes significantly more computational resources, and requires that the hosting server act without malice in generating responses and in protecting the publisher's private key.
We propose a Web search site called "Cyclone", in which a user can retrieve encyclopedic term descriptions on the Web. Cyclone searches the Web for headwords and page fragments describing the headwords. High-quality page fragments are selected as term descriptions and are classified into domains. The number of current headwords is over 700,000.
We propose a technique for the automated synthesis of new composite web services. Given a set of abstract bpel4ws descriptions of component services, and a composition requirement, we automatically generate a concrete bpel4ws process that, when executed, interacts with the components and satisfies the requirement. We implement the proposed approach exploiting efficient representation techniques, and we show its scalability over case studies taken from a real world application and over a parameterized domain.
With the fast increase in Web activities, Web data mining has recently become an important research topic. However, most previous studies of mining path traversal patterns are based on the model of a uniform support threshold without taking into consideration such important factors as the length of a pattern, the positions of Web pages, and the importance of a particular pattern, etc. In view of this, we study and apply the Markov chain model to provide the determination of support threshold of Web documents. Furthermore, by properly employing some techniques devised for joining reference sequences, a new mining procedure of Web traversal patterns is proposed in this paper.
Focused crawlers are considered as a promising way to tackle the scalability problem of topic-oriented or personalized search engines. To design a focused crawler, the choice of strategy for prioritizing unvisited URLs is crucial. In this paper, we propose a method using a decision tree on anchor texts of hyperlinks. We conducted experiments on the real data sets of four Japanese universities and verified our approach.
This talk first gives an overview of an XML project for e-Local Governments, which is under the auspices of MIAC (Ministry of Internal Affairs and Communications) of Japan. This talk then focuses on schema authoring and user interfaces. In particular, the use of four schema languages, namely RELAX NG, W3C XML Schema, DTD, and Schematron, is highlighted.
We consider the problem of finding duplicates in data streams. Duplicate detection in data streams is utilized in various applications including fraud detection. We develop a solution based on Bloom Filters [9], and discuss the space and time requirements for running the proposed algorithm in both the contexts of sliding, and landmark stream windows. We run a comprehensive set of experiments, using both real and synthetic click streams, to evaluate the performance of the proposed solution. The results demonstrate that the proposed solution yields extremely low error rates.
The demand for quickly delivering new applications is increasingly becoming a business imperative today. Application development is often done in an ad hoc manner, without standard frameworks or libraries, thus resulting in poor reuse of software assets. Web services have received much interest in industry due to their potential in facilitating seamless business-to-business or enterprise application integration. A web services composition tool can help automate the process, from creating business process functionality, to developing executable workflows, to deploying them on an execution environment. However, we find that the main approaches taken thus far to standardize and compose web services are piecemeal and insufficient. The business world has adopted a (distributed) programming approach in which web service instances are described using WSDL, composed into flows with a language like BPEL and invoked with the SOAP protocol. Academia has propounded the AI approach of formally representing web service capabilities in ontologies, and reasoning about their composition using goal-oriented inferencing techniques from planning. We present the first integrated work in composing web services end to end from specification to deployment by synergistically combining the strengths of the above approaches. We describe a prototype service creation environment along with a use-case scenario, and demonstrate how it can significantly speed up the time-to-market for new services.
The recent evolution of Internet, driven by the Web services technology, is extending the role of the Web from a support of information interaction to a middleware for B2B interactions. Indeed, the Web services technology allows enterprises to outsource parts of their business processes using Web services. And it also provides the opportunity to dynamically offer new value-added services through the composition of pre-existing Web services. In spite of the growing interest in Web services, current technologies are found lacking efficient transactional support for composite Web services (CSs). In this paper, we propose a transactional approach to ensure the failure atomicity, of a CS, required by partners. We use the Accepted Termination States (ATS) property as a mean to express the required failure atomicity. Partners specify their CS, mainly its control flow, and the required ATS. Then, we use a set of transactional rules to assist designers to compose a valid CS with regards to the specified ATS.
We present a language for specifying web service interfaces. A web service interface puts three kinds of constraints on the users of the service. First, the interface specifies the methods that can be called by a client, together with types of input and output parameters; these are called signature constraints. Second, the interface may specify propositional constraints on method calls and output values that may occur in a web service conversation; these are called consistency constraints. Third, the interface may specify temporal constraints on the ordering of method calls; these are called protocol constraints. The interfaces can be used to check, first, if two or more web services are compatible, and second, if a web service A can be safely substituted for a web service B. The algorithm for compatibility checking verifies that two or more interfaces fulfill each others' constraints. The algorithm for substitutivity checking verifies that service A demands fewer and fulfills more constraints than service B.
We present an approach in which the semantics of an XML language is defined by means of a transformation from an XML document model (an XML schema) to an application specific model. The application specific model implements the intended behavior of documents written in the language. A transformation is specified in a model transformation language used in the Model Driven Architecture (MDA) approach for software development. Our approach provides a better separation of three concerns found in XML applications: syntax, syntax processing logic and intended meaning of the syntax. It frees the developer of low-level syntactical details and improves the adaptability and reusability of XML applications. Declarative transformation rules and the explicit application model provide a finer control over the application parts affected by adaptations. Transformation rules and the application model for an XML language may be composed with the corresponding rules and application models defined for other XML languages. In that way we achieve reuse and composition of XML applications.
As the Web becomes a platform for implementing B2B applications, the need arises of Web conceptual models for describing Web oriented workflow applications implementing business processes. In this context, new problems about process correctness arise, due to the loose control of Web applications upon the behavior of their Web clients. Indeed, incoherent user's behavior can lead to inconsistent processes. This paper presents a high level approach to the management of exceptions that occur during the execution of processes on the Web. We present a classification of exceptions that can be raised inside workflow-driven Web applications, and recovery policies to retrieve coherent status and data after an exception. We devise these concepts at high level and then we exploit them using a Web modeling language (WebML) that in turn provides development facilities like automatic code generation, validation of hypertext models, and so on. An industrial implementation experience is briefly presented too.
WebDAV needs awareness support in order to be a full-fledged collaboration system, This paper introduces AwareDAV, a new WebDAV extension framework enabling shared awareness through event notification. By extending the WebDAV protocol with seven new request-methods and an extensible XML based event subscription scheme, AwareDAV supports fine grained event subscriptions over a range of transport mechanisms and enables a wide range of collaboration scenarios. This paper describes the design of AwareDAV, its API, experiences with its initial implementation, as well as a comparison with Microsoft Exchange and WebDAV-notify.
The reasoning tasks that can be performed with semantic web service descriptions depend on the quality of the domain ontologies used to create these descriptions. However, building such domain ontologies is a time consuming and difficult task. We describe an automatic extraction method that learns domain ontologies for web service descriptions from textual documentations attached to web services. We conducted our experiments in the field of bioinformatics by learning an ontology from the documentation of the web services used in {sup:my}Grid, a project that supports biology experiments on the Grid. Based on the evaluation of the extracted ontology in the context of the project, we conclude that the proposed extraction method is a helpful tool to support the process of building domain ontologies for web service descriptions.
This paper discusses generating document structure from annotated media repositories in a domain-independent manner. This approaches the vision of a universal RDF browser. We start by applying the search-and-browse paradigm established for the WWW to RDF presentation. Furthermore, this paper adds to this paradigm the clustering-based derivation of document structure from search returns, providing simple but domain-independent hypermedia generation from RDF stores. While such generated presentations hardly meet the standards of those written by humans, they provide quick access to media repositories when the required document has not yet been written. The resulting system allows a user to specify a topic for which it generates a hypermedia document providing guided navigation through virtually any RDF repository. The impact for content providers is that as soon as one adds new media items and their annotations to a repository, they become immediately available for automatic integration into subsequently requested presentations.
We investigate the idea of finding semantically related search engine queries based on their temporal correlation; in other words, we infer that two queries are related if their popularities behave similarly over time. To this end, we first define a new measure of the temporal correlation of two queries based on the correlation coefficient of their frequency functions. We then conduct extensive experiments using our measure on two massive query streams from the MSN search engine, revealing that this technique can discover a wide range of semantically similar queries. Finally, we develop a method of efficiently finding the highest correlated queries for a given input query using far less space and time than the naive approach, making real-time implementation possible.
The interoperability among distributed and autonomous systems is the ultimate challenge facing the semantic web. Heterogeneity of data representation is the main source of problems. This paper proposes an innovative solution that combines lexical approaches and language games. The benefits for distributed annotation systems on the web are twofold: firstly, it will reduce the complexity of the semantic problem by moving the focus from the full-featured ontology level to the simpler lexicon level; secondly, it will avoid the drawback of a centralized third party mediator that may become a single point of failure. The main contributions of this work are concerned with: * providing a proof of concept that language games can be an effective solution to creating and managing a distributed process of agreement on a shared lexicon, * describing a fully distributed service oriented architecture for language games, * providing empirical evidence on a real world case study in the domain of ski mountaineering.
The difficulty of developing and deploying commercial web applications increases as the number of technologies they use increases and as the interactions between these technologies become more complex. This paper describes a way to avoid this increasing complexity by re-examining the basic requirements of web applications. Our approach is to first separate client concerns from server concerns, and then to reduce the interaction between client and server to its most elemental: parameter passing. We define a simplified programming model for form-based web applications and we use XForms and a subset of J2EE as enabling technologies. We describe our implementation of an MVC-based application builder for this model, which automatically generates the code needed to marshal input and output data between clients and servers. This marshalling uses type checking and other forms of validation on both clients and servers. We also show how our programming model and application builder support the customization of web applications for different execution targets, including, for example, different client devices.
In this work we present topic diversification, a novel method designed to balance and diversify personalized recommendation lists in order to reflect the user's complete spectrum of interests. Though being detrimental to average accuracy, we show that our method improves user satisfaction with recommendation lists, in particular for lists generated using the common item-based collaborative filtering algorithm. Our work builds upon prior research on recommender systems, looking at properties of recommendation lists as entities in their own right rather than specifically focusing on the accuracy of individual recommendations. We introduce the intra-list similarity metric to assess the topical diversity of recommendation lists and the topic diversification approach for decreasing the intra-list similarity. We evaluate our method using book recommendation data, including offline analysis on 361,349 ratings and an online study involving more than 2,100 subjects.
The Rich News system, that can automatically annotate radio and television news with the aid of resources retrieved from the World Wide Web, is described. Automatic speech recognition gives a temporally precise but conceptually inaccurate annotation model. Information extraction from related web news sites gives the opposite: conceptual accuracy but no temporal data. Our approach combines the two for temporally accurate conceptual semantic annotation of broadcast news. First low quality transcripts of the broadcasts are produced using speech recognition, and these are then automatically divided into sections corresponding to individual news stories. A key phrases extraction component finds key phrases for each story and uses these to search for web pages reporting the same event. The text and meta-data of the web pages is then used to create index documents for the stories in the original broadcasts, which are semantically annotated using the KIM knowledge management platform. A web interface then allows conceptual search and browsing of news stories, and playing of the parts of the media files corresponding to each news story. The use of material from the World Wide Web allows much higher quality textual descriptions and semantic annotations to be produced than would have been possible using the ASR transcript directly. The semantic annotations can form a part of the Semantic Web, and an evaluation shows that the system operates with high precision, and with a moderate level of recall.
The unarguably fast, and continuous, growth of the volume of indexed (and indexable) documents on the Web poses a great challenge for search engines. This is true regarding not only search effectiveness but also time and space efficiency. In this paper we present an index pruning technique targeted for search engines that addresses the latter issue without disconsidering the former. To this effect, we adopt a new pruning strategy capable of greatly reducing the size of search engine indices. Experiments using a real search engine show that our technique can reduce the indices' storage costs by up to 60% over traditional lossless compression methods, while keeping the loss in retrieval precision to a minimum. When compared to the indices size with no
We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as: * Determining the set of categories in a given taxonomy spanned by the search results; * Finding the range of metadata values associated to the result set in order to enable "multi-faceted search;" * Estimating the size of the result set; * Data mining associations to the query terms. We present and analyze an efficient algorithm for obtaining uniform random samples applicable to any search engine based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, e.g. Google, Inktomi, AltaVista, AllTheWeb, belong to this class.) Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic next(p) method that samples term posting lists with probability p, and show how to construct next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods. Finally, we test the efficiency and quality of our approach on both synthetic and real-world data.
Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as caching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level. We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.
This talk presents NTT's approach for realizing a Human-Centered Network. Last November, we announced the NTT Group's Medium-Term Management Strategy, which consists of three management objectives: (1) building the ubiquitous broadband market and helping achieve the e-Japan Strategy and the u-Japan Initiative; (2) building a safe, secure, and convenient communications network environment and broadband access infrastructure, while achieving a seamless migration from the legacy telephone network to the next generation network; and (3) striving to increase corporate value and achieve sustainable growth. Since the management strategy takes account of Japan's future social issues such as declining birthrate and aging population, the need to reduce the environmental load, etc, we believe that the R&D activities directed towards accomplishing these objectives consequently lead to the realization of a Human-Centered Network.
This paper describes XSQirrel, a new XML query language that transforms a document into a sub-document, i.e. a tree where the root-to-leaf paths are a subset of the root-to-leaf paths from the original document. We show that this type of queries is extremely useful for various applications (e.g. web services) and that the currently existing query languages are poorly equipped to express, reason and evaluate such queries. In particular, we emphasize the need to be able to compose such queries. We present the XSQirrel language with its syntax, semantics and two language specific operators, union and composition. For the evaluation of the language, we leverage well established query technologies by translating XSQirrel expressions into XPath programs, XQuery queries or XSLT stylesheets. We provide some experimental results that compare our various evaluation strategies. We also show the runtime benefits of query composition over sequential evaluation.
The increased importance of XML as a data representation format has led to several proposals for facilitating the development of applications that operate on XML data. These proposals range from runtime API-based interfaces to XML-based programming languages. The subject of this paper is XJ, a research language that proposes novel mechanisms for the integration of XML as a first-class construct into Java. The design goals of XJ distinguish it from past work on integrating XML support into programming languages -- specifically, the XJ design adheres to the XML Schema and XPath standards. Moreover, it supports in-place updates of XML data thereby keeping with the imperative nature of Java. We have built a prototype compiler for XJ, and our preliminary experiments demonstrate that the performance of XJ programs can approach that of traditional low-level API-based interfaces, while providing a higher level of abstraction.
Semantic caching is an important technology for improving the response time of future user queries specified over remote servers. This paper deals with the fundamental query containment problem in an XQuery-based semantic caching system. To our best knowledge, the impact of subtle differences in XQuery semantics caused by different ways of specifying variables on query containment has not yet been studied. We introduce the concept of variable binding dependencies for representing the hierarchical element dependencies preserved by an XQuery. We analyze the problem of XQuery containment in the presence of such dependencies. We propose a containment mapping technique for nested XQuery in presence of variable binding dependencies. The implication of the nested block structure on XQuery containment is also considered. We mention the performance gains achieved by a semantic caching system we build based on the proposed technique.
This paper describes the eBag infrastructure, which is a generic infrastructure inspired from work with school children who could benefit from a electronic schoolbag for collaborative handling of their digital material. The eBag infrastructure is utilizing the Context-aware HyCon framework and collaborative web services based on WebDAV. A ubiquitous login and logout mechanism has been built based on BlueTooth sensor networks. The eBag infrastructure has been tried out in field tests with school kids. In this paper we discuss experiences and design issues for ubiquitous Web integration in interactive school environments with multiple interactive whiteboards and workstations. This includes proposals for specialized and adaptive XLink structures for organizing school materials as well as issues in login/logout based on proximity of different display surfaces.
Online Curriculum Portals aim to support networks of instructors and learners by providing a space of convergence for enhancing peer-to-peer learning interactions among individuals of an educational institution. To this end, effective, open and scalable e-learning systems are required to acquire, store, and share knowledge under the form of learning objects (LO). In this paper, we are interested in exploiting the semantic relationships that characterize these LOs (e.g., prerequisite, part-of or see-also) in order to capture and access individual and group knowledge in conjunction with the learning processes supported by educational institutions. To achieve this functionality, Semantic Web (e.g., RDF/s) and declarative query languages (e.g., RQL) are employed to represent LOs and their relationships (e.g., LOM), as well as, to support navigation at the conceptual e-learning Portal space. In this way, different LOs could be presented to the same learners, according to the traversed schema navigation paths (i.e., learning paths). Using the Apache Jetspeed framework we are able to generate and assemble at run-time portlets (i.e., pluggable web components) for visualizing personalized views as dynamic web pages. Last but not least, both learners and instructors can employ the same Portal GUI for updating semantically described LOs and thus support an open-ended continuum of learning. To the best of our knowledge, the work presented in this paper is the first Online Curriculum Portal platform supporting the aforementioned functionality.
Whereas schools typically record mounds of data regarding student performance, attendance, and other behaviors over the course of a school year, rarely is that data consulted and used to inform day-to-day instructional practice in the classroom. As teachers come under increasing pressure to ensure success for all of their students, we are attempting to provide tools to help teachers make sense of what is happening in their classrooms and take appropriate proactive and/or remedial action. One such tool is a Web service we've dubbed the Classroom Sentinel. The Classroom Sentinel mines electronic gradebook and other student information system data sources to detect critical teaching and learning patterns and bring those patterns to the attention of the teacher in the form of timely alerts. In this paper, we introduce the notion of classroom patterns, present some examples, and describe a framework for alert generation and delivery.
Message hierarchies in web discussion boards grow with new postings. Threads of messages evolve as new postings focus within or diverge from the original themes of the threads. Thus, just by investigating the subject headings or contents of earlier postings in a message thread, one may not be able to guess the contents of the later postings. The resulting navigation problem is further compounded for blind users who need the help of a screen reader program that can provide only a linear representation of the content. We see that, in order to overcome the navigation obstacle for blind as well as sighted users, it is essential to develop techniques that help identify how the content of a discussion board grows through generalizations and specializations of topics. This knowledge can be used in segmenting the content in coherent units and guiding the users through segments relevant to their navigational goals. Our experimental results showed that the segmentation algorithm described in this paper provides up to 80-85% success rate in labeling messages. The algorithm is being deployed in a software system to reduce the navigational load of blind students in accessing web-based electronic course materials; however, we note that the techniques are equally applicable for developing web indexing and summarization tools for users with sight.
We present GlobeDB, a system for hosting Web applications that performs autonomic replication of application data. GlobeDB offers data-intensive Web applications the benefits of low access latencies and reduced update traffic. The major distinction in our system compared to existing edge computing infrastructures is that the process of distribution and replication of application data is handled by the system automatically with very little manual administration. We show that significant performance gains can be obtained this way. Performance evaluations with the TPC-W benchmark over an emulated wide-area network show that GlobeDB reduces latencies by a factor of 4 compared to non-replicated systems and reduces update traffic by a factor of 6 compared to fully replicated systems.
Without the proliferation of formal semantic annotations, the Semantic Web is certainly doomed to failure. In earlier work we presented a new paradigm to avoid this: the 'Self Annotating Web', in which globally available knowledge is used to annotate resources such as web pages. In particular, we presented a concrete method instantiating this paradigm, called PANKOW (Pattern-based ANnotation through Knowledge On the Web). In PANKOW, a named entity to be annotated is put into several linguistic patterns that convey competing semantic meanings. The patterns that are matched most often on the Web indicate the meaning of the named entity -- leading to automatic or semi-automatic annotation. In this paper we present C-PANKOW (Context-driven PANKOW), which alleviates several shortcomings of PANKOW. First, by downloading abstracts and processing them off-line, we avoid the generation of large number of linguistic patterns and correspondingly large number of Google queries. Second, by linguistically analyzing and normalizing the downloaded abstracts, we increase the coverage of our pattern matching mechanism and overcome several limitations of the earlier pattern generation process. Third, we use the annotation context in order to distinguish the significance of a pattern match for the given annotation task. Our experiments show that C-PANKOW inherits all the advantages of PANKOW (no training required etc.), but in addition it is far more efficient and effective.
The Web has become an excellent source for gathering consumer opinions. There are now numerous Web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. This paper focuses on online customer reviews of products. It makes two contributions. First, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. A prototype system called Opinion Observer is also implemented. The system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. This comparison is useful to both potential customers and product manufacturers. For a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. For a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. Second, a new technique based on language pattern mining is proposed to extract product features from Pros and Cons in a particular type of reviews. Such features form the basis for the above comparison. Experimental results show that the technique is highly effective and outperform existing methods significantly.
Internet users now rely on a whole arsenal of tools to protect their security and privacy. Experts recommend that computer users install personal firewalls, anti-virus software, spyware blockers, spam filters, cookie managers, and a variety of other tools to keep themselves safe. Users are told to pick hard-to-guess passwords, use a different password at every Web site, and not to write any of their passwords down. They are told to read privacy policies before providing personal information to Web sites, look for lock icons before typing in a credit card number, refrain from opening email attachments from people they don't know, and even to think twice about opening email attachments from people they do know. With so many do's and don'ts, it is not surprising that much of this advice is ignored. In this talk I will highlight usability problems that make it difficult for people to protect their privacy and security on the Web, and I will discuss a number of approaches to addressing these problems.
Currently, the vast majority of web sites do not support accessibility for visually impaired users. Usually, these users have to rely on screen readers: applications that sequentially read the content of a web page in audio. Unfortunately, screen readers are not able to detect the meaning of the different page objects, and thus the implicit semantic knowledge conveyed in the presentation of the page is lost. One approach described in literature to tackle this problem, is the Dante approach, which allows semantic annotation of web pages to provide screen readers with extra (semantic) knowledge to better facilitate the audio presentation of a web page. Until now, such annotations were done manually, and failed for dynamic pages. In this paper, we combine the Dante approach with a web design method, WSDM, to fully automate the generation of the semantic annotation for visually impaired users. To do so, the semantic knowledge gathered during the design process is exploited, and the annotations are generated as a by-product of the design process, requiring no extra effort from the designer.
We present a usage consultation tool, based on Internet searching, for language learners. When a user enters a string of words for which he wants to find usages, the system sends this string as a query to a search engine and obtains search results about the string. The usages are extracted by performing statistical analysis on snippets and then fed back to the user. Unlike existing tools, this usage consultation tool is multi-lingual, so that usages can be obtained even in a language for which there are no well-established analytical methods. Our evaluation has revealed that usages can be obtained more effectively than by only using a search engine directly. Also, we have found that the resulting usage does not depend on the search engine for a prominent usage when the amount of data downloaded from the search engine is increased.
Portlets (i.e. multi-step, user-facing applications to be syndicated within a portal) are currently supported by most portal frameworks. However, there is not yet a definitive answer to portlet interoperation whereby data flows smoothly from one portlet to a neighbouring one. Both data-based and API-based approaches exhibit some drawbacks in either the limitation of the sharing scope or the standardization effort required. We argue that these limitations can be overcome by using deep annotation techniques. By providing additional markup about the background services, deep annotation strives to interact with these underlying services rather than with the HTML surface that conveys the markup. In this way, the portlet producer can extend a portlet markup, a fragment, with data about the processes whose rendering this fragment supports. Then, the portlet consumer (e.g. a portal) can use deep annotation to map an output process in fragment A to an input process in fragment B. This mapping results in fragment B having its input form (or other "input" widget) filled up. We consider deep annotation as particularly valid for portlet interoperation due to the controlled and cooperative environment that characterizes the portal setting.
As the competition of Web search market increases, there is a high demand for personalized Web search to conduct retrieval incorporating Web users' information needs. This paper focuses on utilizing clickthrough data to improve Web search. Since millions of searches are conducted everyday, a search engine accumulates a large volume of clickthrough data, which records who submits queries and which pages he/she clicks on. The clickthrough data is highly sparse and contains different types of objects (user, query and Web page), and the relationships among these objects are also very complicated. By performing analysis on these data, we attempt to discover Web users' interests and the patterns that users locate information. In this paper, a novel approach CubeSVD is proposed to improve Web search. The clickthrough data is represented by a 3-order tensor, on which we perform 3-mode analysis using the higher-order singular value decomposition technique to automatically capture the latent factors that govern the relations among these multi-type objects: users, queries and Web pages. A tensor reconstructed based on the CubeSVD analysis reflects both the observed interactions among these objects and the implicit associations among them. Therefore, Web search activities can be carried out based on CubeSVD analysis. Experimental evaluations using a real-world data set collected from an MSN search engine show that CubeSVD achieves encouraging search results in comparison with some standard methods.
There has been recent interests in studying the "goal" behind a user's Web query, so that this goal can be used to improve the quality of a search engine's results. Previous studies have mainly focused on using manual query-log investigation to identify Web query goals. In this paper we study whether and how we can automate this goal-identification process. We first present our results from a human subject study that strongly indicate the feasibility of automatic query-goal identification. We then propose two types of features for the goal-identification task: user-click behavior and anchor-link distribution. Our experimental evaluation shows that by combining these features we can correctly identify the goals for 90% of the queries studied.
Search engines are the primary gateways of information access on the Web today. Behind the scenes, search engines crawl the Web to populate a local indexed repository of Web pages, used to answer user search queries. In an aggregate sense, the Web is very dynamic, causing any repository of Web pages to become out of date over time, which in turn causes query answer quality to degrade. Given the considerable size, dynamicity, and degree of autonomy of the Web as a whole, it is not feasible for a search engine to maintain its repository exactly synchronized with the Web. In this paper we study how to schedule Web pages for selective (re)downloading into a search engine repository. The scheduling objective is to maximize the quality of the user experience for those who query the search engine. We begin with a quantitative characterization of the way in which the discrepancy between the content of the repository and the current content of the live Web impacts the quality of the user experience. This characterization leads to a user-centric metric of the quality of a search engine's local repository. We use this metric to derive a policy for scheduling Web page (re)downloading that is driven by search engine usage and free of exterior tuning parameters. We then focus on the important subproblem of scheduling refreshing of Web pages already present in the repository, and show how to compute the priorities efficiently. We provide extensive empirical comparisons of our user-centric method against prior Web page refresh strategies, using real Web data. Our results demonstrate that our method requires far fewer resources to maintain same search engine quality level for users, leaving substantially more resources available for incorporating new Web pages into the search repository.
A fair contract signing protocol allows two potentially mistrusted parities to exchange their commitments (i.e., digital signatures) to an agreed contract over the Internet in a fair way, so that either each of them obtains the other's signature, or neither party does. Based on the RSA signature scheme, a new digital contract signing protocol is proposed in this paper. Like the existing RSA-based solutions for the same problem, our protocol is not only fair, but also optimistic, since the third trusted party is involved only in the situations where one party is cheating or the communication channel is interrupted. Furthermore, the proposed protocol satisfies a new property, i.e., it is abuse-free. That is, if the protocol is executed unsuccessfully, none of the two parties can show the validity of intermediate results to others. Technical details are provided to analyze the security and performance of the proposed protocol. In summary, we present the first abuse-free fair contract signing protocol based on the RSA signature, and show that it is both secure and efficient.
Reputation systems have been popular in estimating the trustworthiness and predicting the future behavior of nodes in a large-scale distributed system where nodes may transact with one another without prior knowledge or experience. One of the fundamental challenges in distributed reputation management is to understand vulnerabilities and develop mechanisms that can minimize the potential damages to a system by malicious nodes. In this paper, we identify three vulnerabilities that are detrimental to decentralized reputation management and propose TrustGuard -- a safeguard framework for providing a highly dependable and yet efficient reputation system. First, we provide a dependable trust model and a set of formal methods to handle strategic malicious nodes that continuously change their behavior to gain unfair advantages in the system. Second, a transaction based reputation system must cope with the vulnerability that malicious nodes may misuse the system by flooding feedbacks with fake transactions. Third, but not least, we identify the importance of filtering out dishonest feedbacks when computing reputation-based trust of a node, including the feedbacks filed by malicious nodes through collusion. Our experiments show that, comparing with existing reputation systems, our framework is highly dependable and effective in countering malicious nodes regarding strategic oscillating behavior, flooding malevolent feedbacks with fake transactions, and dishonest feedbacks.
While overall bandwidth in the internet has grown rapidly over the last few years, and an increasing number of clients enjoy broadband connectivity, many others still access the internet over much slower dialup or wireless links. To address this issue, a number of techniques for optimized delivery of web and multimedia content over slow links have been proposed, including protocol optimizations, caching, compression, and multimedia transcoding, and several large ISPs have recently begun to widely promote dialup acceleration services based on such techniques. A recent paper by Rhea, Liang, and Brewer proposed an elegant technique called value-based caching that caches substrings of files, rather than entire files, and thus avoids repeated transmission of substrings common to several pages or page versions. We propose and study a hierarchical substring caching technique that provides significant savings over this basic approach. We describe several additional techniques for minimizing overheads and perform an evaluation on a large set of real web access traces that we collected. In the second part of our work, we compare our approach to a widely studied alternative approach based on delta compression, and show how to integrate the two for best overall performance. The studied techniques are typically employed in a client-proxy environment, with each proxy serving a large number of clients, and an important aspect is how to conserve resources on the proxy while exploiting the significant memory and CPU power available on current clients.
Server-side programming is one of the key technologies that support today's WWW environment. It makes it possible to generate Web pages dynamically according to a user's request and to customize pages for each user. However, the flexibility obtained by server-side programming makes it much harder to guarantee validity and security of dynamically generated pages. To check statically the properties of Web pages generated dynamically by a server-side program, we develop a static program analysis that approximates the string output of a program with a context-free grammar. The approximation obtained by the analyzer can be used to check various properties of a server-side program and the pages it generates. To demonstrate the effectiveness of the analysis, we have implemented a string analyzer for the server-side scripting language PHP. The analyzer is successfully applied to publicly available PHP programs to detect cross-site scripting vulnerabilities and to validate pages they generate dynamically.
Many modern natural language-processing applications utilize search engines to locate large numbers of Web documents or to compute statistics over the Web corpus. Yet Web search engines are designed and optimized for simple human queries -- they are not well suited to support such applications. As a result, these applications are forced to issue millions of successive queries resulting in unnecessary search engine load and in slow applications with limited scalability. In response, this paper introduces the Bindings Engine (BE), which supports queries containing typed variables and string-processing functions. For example, in response to the query "powerful " BE will return all the nouns in its index that immediately follow the word "powerful", sorted by frequency. In response to the query "Cities such as ProperNoun(Head())", BE will return a list of proper nouns likely to be city names. BE's novel neighborhood index enables it to do so with O(k) random disk seeks and O(k) serial disk reads, where k is the number of non-variable terms in its query. As a result, BE can yield several orders of magnitude speedup for large-scale language-processing applications. The main cost is a modest increase in space to store the index. We report on experiments validating these claims, and analyze how BE's space-time tradeoff scales with the size of its index and the number of variable types. Finally, we describe how a BE-based application extracts thousands of facts from the Web at interactive speeds in response to simple user queries.
Semantic Portal is the next generation of web portals that are powered by Semantic Web technologies for improved information sharing and exchange for a community of users. Current methods of searching in Semantic Portals are limited to keyword-based search using information retrieval (IR) techniques, ontology-based formal query and reasoning, or a simple combination of the two. In this paper, we propose an enhanced model that tightly integrates IR with formal query and reasoning to fully utilize both textual and semantic information for searching in Semantic Portals. The model extends the search capabilities of existing methods and can answer more complex search requests. The ideas in a fuzzy description logic (DL) IR model and a formal DL query method are employed and combined in our model. Based on the model, a semantic search service is implemented and evaluated. The evaluation shows very large improvements over existing methods.
Say you are looking for information about a particular person. A search engine returns many pages for that person's name but which pages are about the person you care about, and which are about other people who happen to have the same name? Furthermore, if we are looking for multiple people who are related in some way, how can we best leverage this social network? This paper presents two unsupervised frameworks for solving this problem: one based on link structure of the Web pages, another using Agglomerative/Conglomerative Double Clustering (A/CDC) -- an application of a recently introduced multi-way distributional clustering method. To evaluate our methods, we collected and hand-labeled a dataset of over 1000 Web pages retrieved from Google queries on 12 personal names appearing together in someones in an email folder. On this dataset our methods outperform traditional agglomerative clustering by more
Computer users are asked to generate, keep secret, and recall an increasing number of passwords for uses including host accounts, email servers, e-commerce sites, and online financial services. Unfortunately, the password entropy that users can comfortably memorize seems insufficient to store unique, secure passwords for all these accounts, and it is likely to remain constant as the number of passwords (and the adversary's computational power) increases into the future. In this paper, we propose a technique that uses a strengthened cryptographic hash function to compute secure passwords for arbitrarily many accounts while requiring the user to memorize only a single short password. This mechanism functions entirely on the client; no server-side changes are needed. Unlike previous approaches, our design is both highly resistant to brute force attacks and nearly stateless, allowing users to retrieve their passwords from any location so long as they can execute our program and remember a short secret. This combination of security and convenience will, we believe, entice users to adopt our scheme. We discuss the construction of our algorithm in detail, compare its strengths and weaknesses to those of related approaches, and present Password Multiplier, an implementation in the form of an extension to the Mozilla Firefox web browser.
Website privacy policies state the ways that a site will use personal identifiable information (PII) that is collected from fields and forms in web-based transactions. Since these policies can be complex, machine-readable versions have been developed that allow automatic comparison of a site's privacy policy with a user's privacy preferences. However, it is still difficult for users to determine the cause and origin of conformance conflicts, because current standards operate at the page level -- they can only say that there is a conflict on the page, not where the conflict occurs or what causes it. In this paper we describe fine-grained policy anchors, an extension to the way a website implements the Platform for Privacy Preferences (P3P), that solves this problem. Fine grained policy anchors enable field-level comparisons of policy and preference, field-specific conformance displays, and faster access to additional conformance information. We built a prototype user agent based on these extensions and tested it with representative users. We found that fine-grained anchors do help users understand how privacy policy relates to their privacy preferences, and where and why conformance conflicts occur.
Existing Web browsers handle security errors in a manner that often confuses users. In particular, when a user visits a secure site whose certificate the browser cannot verify, the browser typically allows the user to view and install the certificate and connect to the site despite the verification failure. However, few users understand the risk of man-in-the-middle attacks and the principles behind certificate-based authentication. We propose context-sensitive certificate verification (CSCV), whereby the browser interrogates the user about the context in which a certificate verification error occurs. Considering the context, the browser then guides the user in handling and possibly overcoming the security error. We also propose specific password warnings (SPW) when users are about to send passwords in a form vulnerable to eavesdropping. We performed user studies to evaluate CSCV and SPW. Our results suggest that CSCV and SPW can greatly improve Web browsing security and are easy to use even without training. Moreover, CSCV had greater impact than did staged security training.
Web performance measurements and availability tests have been carried out using a variety of infrastructures over the last several years. Disruptions in the Internet can lead to Web sites being unavailable or increase user-perceived latency. The unavailability could be due to DNS, failures in segments of the physical network cutting off thousands of users, or attacks. Prompt reactions to network-wide events can be facilitated by local or remote measurement and monitoring. Better yet, a distributed set of intercommunicating measurement and monitoring entities that react to events dynamically could go a long way to handle disruptions. We have designed and built ATMEN, a triggered measurement infrastructure to communicate and coordinate across various administrative entities. ATMEN nodes can trigger new measurements, query ongoing passive measurements or historical measurements stored on remote nodes, and coordinate the responses to make local decisions. ATMEN reduces wasted measurements by judiciously reusing measurements along three axes: spatial, temporal, and application. We describe the use of ATMEN for key Web applications such as performance based ranking of popular Web sites and availability of DNS servers on which most Web transactions are dependent. The evaluation of ATMEN is done using multiple network monitoring entities called Gigascopes installed across the USA, measurement data of a popular network application involving millions of users distributed across the Internet, and scores of clients to aid in gathering measurement information upon demand. Our results show that such a system can be built in a scalable fashion.
We offer the first large-scale analysis of Web traffic based on network flow data. Using data collected on the Internet2 network, we constructed a weighted bipartite client-server host graph containing more than 18 x 10{sup:6} vertices and 68 x 10{sup:6} edges valued by relative traffic flows. When considered as a traffic map of the World-Wide Web, the generated graph provides valuable information on the statistical patterns that characterize the global information flow on the Web. Statistical analysis shows that client-server connections and traffic flows exhibit heavy-tailed probability distributions lacking any typical scale. In particular, the absence of an intrinsic average in some of the distributions implies the absence of a prototypical scale appropriate for server design, Web-centric network design, or traffic modeling. The inspection of the amount of traffic handled by clients and servers and their number of connections highlights non-trivial correlations between information flow and patterns of connectivity as well as the presence of anomalous statistical patterns related to the behavior of users on the Web. The results presented here may impact considerably the modeling, scalability analysis, and behavioral study of Web applications.
In this paper, we study the media workload collected from a large number of commercial Web sites hosted by a major ISP and that collected from a large group of home users connected to the Internet via a well-known cable company. Some of our key findings are: (1) Surprisingly, the majority of media contents are still delivered via downloading from Web servers. (2) A substantial percentage of media downloading connections are aborted before completion due to the long waiting time. (3) A hybrid approach, pseudo streaming, is used by clients to imitate real streaming. (4) The mismatch between the downloading rate and the client playback speed in pseudo streaming is common, which either causes frequent playback delays to the clients, or unnecessary traffic to the Internet. (5) Compared with streaming, downloading and pseudo streaming are neither bandwidth efficient nor performance effective. To address this problem, we propose the design of AutoStream, an innovative system that can provide additional previewing and streaming services automatically for media objects hosted on standard Web sites in server farms at the client's will.
While many works have been devoted to service matchmaking and modeling nonfunctional properties, the problem of matching service requests to offers in an optimal way has not yet been extensively studied. In this paper we formalize three kinds of optimal service selection problems, based on different criteria. Then we study their complexity and implement solutions. We prove that one-time costs make the optimal selection problem computationally hard; in the absence of these costs the problem can be solved in polynomial time. We designed and implemented both exact and heuristic (suboptimal) algorithms for the hard case, and carried out a preliminary experimental evaluation with interesting results.
RDF is increasingly being used to represent metadata. RDF Site Summary (RSS) is an application of RDF on the Web that has considerably grown in popularity. However, the way RSS systems operate today does not scale well. In this paper we introduce G-ToPSS, a scalable publish/subscribe system for selective information dissemination. G-ToPSS is particularly well suited for applications that deal with large-volume content distribution from diverse sources. RSS is an instance of the content distribution problem. G-ToPSS allows use of ontology as a way to provide additional information about the data. Furthermore, in this paper we show how G-ToPSS can support RDFS class taxonomies. We have implemented and experimentally evaluated G-ToPSS and we provide results in the paper demonstrating its scalability compared to alternatives.
Continuous queries are used to monitor changes to time varying data and to provide results useful for online decision making. Typically a user desires to obtain the value of some function over distributed data items, for example, to determine when and whether (a) the traffic entering a highway from multiple feed roads will result in congestion in a thoroughfare or (b) the value of a stock portfolio exceeds a threshold. Using the standard Web infrastructure for these applications will increase the reach of the underlying information. But, since these queries involve data from multiple sources, with sources supporting standard HTTP (pull-based) interfaces, special query processing techniques are needed. Also, these applications often have the flexibility to tolerate some incoherency, i.e., some differences between the results reported to the user and that produced from the virtual database made up of the distributed data sources. In this paper, we develop and evaluate client-pull-based techniques for refreshing data so that the results of the queries over distributed data can be correctly reported, conforming to the limited incoherency acceptable to the users. We model as well as estimate the dynamics of the data items using a probabilistic approach based on Markov Chains. Depending on the dynamics of data we adapt the data refresh times to deliver query results with the desired coherency. The commonality of data needs of multiple queries is exploited to further reduce refresh overheads. Effectiveness of our approach is demonstrated using live sources of dynamic data: the number of refreshes it requires is (a) an order of magnitude less than what we would need if every potential update is pulled from the sources, and (b) comparable to the number of messages needed by an ideal algorithm, one that knows how to optimally refresh the data from distributed data sources. Our evaluations also bring out a very practical and attractive tradeoff property of pull based approaches, e.g., a small increase in tolerable incoherency leads to a large decrease in message overheads.
In this paper, we focus on the development of a framework for automatic metadata generation. The first step towards this framework is the definition of an Application Programmer Interface (API), which we call the Simple Indexing Interface (SII). The second step is the definition of a framework for implementation of the SII. Both steps are presented in some detail in this paper. We also report on empirical evaluation of the metadata that the SII and supporting framework generated in a real-life context.
PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α = 0.85 by Brin and Page is still used. Recently, however, the behaviour of PageRank with respect to changes in α was discovered to be useful in link-spam detection[21]. Moreover, an analytical justification of the value chosen for α is still missing. In this paper, we give the first mathematical analysis of PageRank when α changes. In particular, we show that, contrarily to popular belief, for real-world graphs values of α close to 1 do not give a more meaningful ranking. Then, we give closed-form formulae for PageRank derivatives of any order, and an extension of the Power Method that approximates them with convergence O (t{sup:k} α{sup:t}) for the k-th derivative. Finally, we show a tight connection between iterated computation and analytical behaviour by proving that the k-th iteration of the Power Method gives exactly the PageRank value obtained using a Maclaurin polynomial of degree k. The latter result paves the way towards the application of analytical methods to the study of PageRank.
In contrast with the current Web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable Web search at the object level. We collect Web information for objects relevant for a specific application domain and rank these objects in terms of their relevance and popularity to answer user queries. Traditional PageRank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. This paper introduces PopRank, a domain-independent object-level link analysis model to rank the objects within a specific domain. Specifically we assign a popularity propagation factor to each type of object relationship, study how different popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose efficient approaches to automatically decide these factors. Our experiments are done using 1 million CS papers, and the experimental results show that PopRank can achieve significantly better ranking results than naively applying PageRank on the object graph.
In this note we consider a simple reformulation of the traditional power iteration algorithm for computing the stationary distribution of a Markov chain. Rather than communicate their current probability values to their neighbors at each step, nodes instead communicate only changes in probability value. This reformulation enables a large degree of flexibility in the manner in which nodes update their values, leading to an array of optimizations and features, including faster convergence, efficient incremental updating, and a robust distributed implementation. While the spirit of many of these optimizations appear in previous literature, we observe several cases where this unification simplifies previous work, removing technical complications and extending their range of applicability. We implement and measure the performance of several optimizations on a sizable (34M node) web subgraph, seeing significant composite performance gains, especially for the case of incremental recomputation after changes to the web graph.
Experienced web users have strategies for information search and re-access that are not directly supported by web browsers or search engines. We studied how prevalent these strategies are and whether even experienced users have problems with searching and re-accessing information. With this aim, we conducted a survey with 236 experienced web users. The results showed that this group has frequently used key strategies (e.g., using several browser windows in parallel) that they find important, whereas some of the strategies that have been suggested in previous studies are clearly less important for them (e.g., including URLs on a webpage). In some aspects, such as query formulation, this group resembles less experienced web users. For instance, we found that most of the respondents had misconceptions about how their search engine handles queries, as well as other problems with information search and re-access. In addition to presenting the prevalence of the strategies and rationales for their use, we present concrete designs solutions and ideas for making the key strategies also available to less experienced users.
Focused Web browsing activities such as periodically looking up headline news, weather reports, etc., which require only selective fragments of particular Web pages, can be made more efficient for users of limited-display-size handheld mobile devices by delivering only the target fragments. Semantic bookmarks provide a robust conceptual framework for recording and retrieving such targeted content not only from the specific pages used in creating the bookmarks but also from any user-specified page with similar content semantics. This paper describes a technique for realizing semantic bookmarks by coupling machine learning with Web page segmentation to create a statistical model of the bookmarked content. These models are used to identify and retrieve the bookmarked content from Web pages that share a common content domain. In contrast to ontology-based approaches where semantic bookmarks are limited to available concepts in the ontology, the learning-based approach allows users to bookmark ad-hoc personalized semantic concepts to effectively target content that fits the limited display of handhelds. User evaluation measuring the effectiveness of a prototype implementation of learning-based semantic bookmarking at reducing browsing fatigue in handhelds is provided.
We present WebPod, a portable system that enables mobile users to use the same persistent, personalized web browsing session on any Internet-enabled device. No matter what computer is being used, WebPod provides a consistent browsing session, maintaining all of a user's plugins, bookmarks, browser web content, open browser windows, and browser configuration options and preferences. This is achieved by leveraging rapid improvements in capacity, cost, and size of portable storage devices. WebPod provides a virtualization and checkpoint/restart mechanism that decouples the browsing environment from the host, enabling web browsing sessions to be suspended to portable storage, carried around, and resumed from the storage device on another computer. WebPod virtualization also isolates web browsing sessions from the host, protecting the browsing privacy of the user and preventing malicious web content from damaging the host. We have implemented a Linux WebPod prototype and demonstrate its ability to quickly suspend and resume web browsing sessions, enabling a seamless web browsing experience for mobile users as they move among computers.
The Semantic Web consists of many RDF graphs nameable by URIs. This paper extends the syntax and semantics of RDF to cover such Named Graphs. This enables RDF statements that describe graphs, which is beneficial in many Semantic Web application areas. As a case study, we explore the application area of Semantic Web publishing: Named Graphs allow publishers to communicate assertional intent, and to sign their graphs; information consumers can evaluate specific graphs using task-specific trust policies, and act on information from those Named Graphs that they accept. Graphs are trusted depending on: their content; information about the graph; and the task the user is performing. The extension of RDF to Named Graphs provides a formally defined framework to be a foundation for the Semantic Web trust layer.
The Semantic Web languages RDFS and OWL have been around for some time now. However, the presence of these languages has not brought the breakthrough of the Semantic Web the creators of the languages had hoped for. OWL has a number of problems in the area of interoperability and usability in the context of many practical application scenarios which impede the connection to the Software Engineering and Database communities. In this paper we present OWL Flight, which is loosely based on OWL, but the semantics is grounded in Logic Programming rather than Description Logics, and it borrows the constraint-based modeling style common in databases. This results in different types of modeling primitives and enforces a different style of ontology modeling. We analyze the modeling paradigms of OWL DL and OWL Flight, as well as reasoning tasks supported by both languages. We argue that different applications on the Semantic Web require different styles of modeling and thus both types of languages are required for the Semantic Web.
As an increasingly large number of OWL ontologies become available on the Semantic Web and the descriptions in the ontologies become more complicated, finding the cause of errors becomes an extremely hard task even for experts. Existing ontology development environments provide some limited support, in conjunction with a reasoner, for detecting and diagnosing errors in OWL ontologies. Typically these are restricted to the mere detection of, for example, unsatisfiable concepts. We have integrated a number of simple debugging cues generated from our description logic reasoner, Pellet, in our hypertextual ontology development environment, Swoop. These cues, in conjunction with extensive undo/redo and Annotea based collaboration support in Swoop, significantly improve the OWL debugging experience, and point the way to more general improvements in the presentation of an ontology to new users.
To exploit the similarity information hidden in the hyperlink structure of the web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture. The similarity of multi-step neighborhoods of vertices are numerically evaluated by similarity functions including SimRank [20], a recursive refinement of cocitation; PSimRank, a novel variant with better theoretical characteristics; and the Jaccard coefficient, extended to multi-step neighborhoods. Our methods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. The performance and quality of the methods were tested on the Stanford Webbase [19] graph of 80M pages by comparing our scores to similarities extracted from the ODP directory [26]. Our experimental results suggest that the hyperlink structure of vertices within four to five steps provide more adequate information for similarity search than single-step neighborhoods.
We consider the problem of indexing high-dimensional data for answering (approximate) similarity-search queries. Similarity indexes prove to be important in a wide variety of settings: Web search engines desire fast, parallel, main-memory-based indexes for similarity search on text data; database systems desire disk-based similarity indexes for high-dimensional data, including text and images; peer-to-peer systems desire distributed similarity indexes with low communication cost. We propose an indexing scheme called LSH Forest which is applicable in all the above contexts. Our index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by (a) eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and (b) improving on LSH's performance guarantees for skewed data distributions while retaining the same storage and query overhead. We show how to construct this index in main memory, on disk, in parallel systems, and in peer-to-peer systems. We evaluate the design with experiments on multiple text corpora and demonstrate both the self-tuning nature and the superior performance of LSH Forest.
When a query is submitted to a search engine, the search engine returns a dynamically generated result page containing the result records, each of which usually consists of a link to and/or snippet of a retrieved Web page. In addition, such a result page often also contains information irrelevant to the query, such as information related to the hosting site of the search engine and advertisements. In this paper, we present a technique for automatically producing wrappers that can be used to extract search result records from dynamically generated result pages returned by search engines. Automatic search result record extraction is very important for many applications that need to interact with search engines such as automatic construction and maintenance of metasearch engines and deep Web crawling. The novel aspect of the proposed technique is that it utilizes both the visual content features on the result page as displayed on a browser and the HTML tag structures of the HTML source file of the result page. Experimental results indicate that this technique can achieve very high extraction accuracy.
We introduce a stricter Web community definition to overcome boundary ambiguity of a Web community defined by Flake, Lawrence and Giles [2], and consider the problem of finding communities that satisfy our definition. We discuss how to find such communities and hardness of this problem. We also propose Web page partitioning by equivalence relation defined using the class of communities of our definition. Though the problem of efficiently finding all communities of our definition is NP-complete, we propose an efficient method of finding a subclass of communities among the sets partitioned by each of n-1 cuts represented by a Gomory-Hu tree [10], and partitioning a Web graph by equivalence relation defined using the subclass. According to our preliminary experiments, partitioning by our method divided the pages retrieved by keyword search into several different categories to some extent.
This paper proposes an incremental maintenance algorithm that efficiently updates the materialized XPath/XSLT views defined using XPath expressions in XP{sup:([],*,//,vars)}. The algorithm consists of two processes. 1) The dynamic execution flow of an XSLT program is stored as an XT (XML Transformation) tree during the full transformation. 2) In response to a source XML data update, the impacted portions of the XT-tree are identified and maintained by partially re-evaluating the XSLT program. This paper discusses the XPath/XSLT features of incremental view maintenance for subtree insertion/deletion and applies them to the maintenance algorithm. Experiments show that the incremental maintenance algorithm outperforms full XML transformation algorithms by factors of up to 500.
As XQuery is gathering momentum as the standard query language for XML, there is a growing interest in using it as an integral part of the XML application development infrastructure. In that context, one question which is often raised is how well XQuery interoperates with other XML languages, and notably with XSLT. XQuery 1.0 [16] and XSLT 2.0 [7] share a lot in common: they share XPath 2.0 as a common sub-language and have the same expressiveness. However, they are based on fairly different programming paradigms. While XSLT has adopted a highly declarative template based approach, XQuery relies on a simpler, and more operational, functional approach. In this paper, we present an approach to compile XSLT 2.0 into XQuery 1.0, and a working implementation of that approach. The compilation rules explain how XSLT's template-based approach can be implemented using the functional approach of XQuery and underpins the tight connection between the two languages. The resulting compiler can be used to migrate a XSLT code base to XQuery, or to enable the use of X