The World Wide Web Conference is the global event that brings together key researchers, innovators, decision-makers, technologists, businesses, and standards bodies working to shape the Web. Organized by IW3C2 since 1994, the WWW conference is the annual opportunity for the International community to discuss and debate the evolution of the Web.
Topics in education are changing with an ever faster pace. ELearning resources tend to be more and more decentralized. Users increasingly need to be able to use the resources of the web. For this, they should have tools for finding and organizing information in a decentralized way. In this paper, we show how an ontology-based tool suite allows to make the most of the resources available on the web.
We seek to gain improved insight into how Web search engines should cope with the evolving Web, in an attempt to provide users with the most up-to-date results possible. For this purpose we collected weekly snapshots of some 150 Web sites over the course of one year,and measured the evolution of content and link structure. Our measurements focus on aspects of potential interest to search engine designers: the evolution of link structure over time, the rate of creation of new pages and new distinct content on the Web, and the rate of change of the content of existing pages under search-centric measures of degree of change. Our findings indicate a rapid turnover rate of Web pages, i.e.,high rates of birth and death, coupled with an even higher rate of turnover in the hyperlinks that connect them. For pages that persist over time we found that, perhaps surprisingly, the degree of content shift as measured using TF.IDF cosine distance does not appear to be consistently correlated with the frequency of content updating. Despite this apparent non-correlation, the rate of content shift of a given page is likely to remain consistent over time. That is, pages that change a great deal in one week will likely change by a similarly large degree in the following week. Conversely, pages that experience little change will continue to experience little change. We conclude the paper with a discussion of the potential implications of our results for the design of effective Web search engines.
Manually querying search engines in order to accumulate a large body of factual information is a tedious, error-prone process of piecemeal search. Search engines retrieve and rank potentially relevant documents for human perusal, but do not extract facts, assess confidence, or fuse information from multiple documents. This paper introduces KnowItAll, a system that aims to automate the tedious process of extracting large collections of facts from the web in an autonomous,domain-independent, and scalable manner. The paper describes preliminary experiments in which an instance of KnowItAll, running for four days on a single machine, was able to automatically extract 54,753 facts. KnowItAll associates a probability with each fact enabling it to trade off precision and recall. The paper analyzes KnowItAll's architecture and reports on lessons learned for the design of large-scale information extraction systems.
This paper presents KnowledgeTree, an architecture for adaptive E-Learning based on distributed reusable intelligent learning activities. The goal of KnowledgeTree is to bridge the gap between the currently popular approach to Web-based education, which is centered on learning management systems vs. the powerful but underused technologies in intelligent tutoring and adaptive hypermedia. This integrative architecture attempts to address both the component-based assembly of adaptive systems and teacher-level reusability.
EducaNext is an educational mediator created within the UNIVERSAL IST Project which supports both, the exchange of reusable educational materials based on open standards, as well as the collaboration of educators over the network in the realization of educational activities. The Isabel CSCW application is a group collaboration tool for the Internet supporting audience interconnection over the network, such as distributed classrooms, conferences or meetings. This paper describes the conclusions and feedback obtained from the integration of Isabel into EducaNext, it's use for the realization of collaborative educational activities involving distributed classrooms, lectures or workshops, as well as the general conclusions obtained about the integration of synchronous collaboration applications into educational mediators.
We present a question answering (QA) system which learns how to detect and rank answer passages by analyzing questions and their answers (QA pairs) provided as training data. We built our system in only a few person-months using off-the-shelf components: a part-of-speech tagger, a shallow parser, a lexical network, and a few well-known supervised learning algorithms. In contrast, many of the top TREC QA systems are large group efforts, using customized ontologies, question classifiers, and highly tuned ranking functions. Our ease of deployment arises from using generic, trainable algorithms that exploit simple feature extractors on QA pairs. With TREC QA data, our system achieves mean reciprocal rank (MRR) that compares favorably with the best scores in recent years, and generalizes from one corpus to another. Our key technique is to recover, from the question, fragments of what might have been posed as a structured query, had a suitable schema been available. comprises selectors: tokens that are likely to appear (almost) unchanged in an answer passage. The other fragment contains question tokens which give clues about the answer type, and are expected to be replaced in the answer passage by tokens which specialize or instantiate the desired answer type. Selectors are like constants in where-clauses in relational queries, and answer types are like column names. We present new algorithms for locating selectors and answer type clues and using them in scoring passages with respect to a question.
Learning styles, as well as the best ways of responding with corresponding instructional strategies, have been intensively studied in the classical educational (classroom) setting. There is much less research of application of learning styles in the new educational space, created by the Web. Moreover, authoring applications are scarce, and they do not provide explicit choices and creation of instructional strategies for specific learning styles. The main objective of the research described in this paper is to provide the authors with a tool which will allow them to incorporate different learning styles in their adaptive educational hypermedia applications. In this way, we are creating a semantically significant interface between classical learning styles and instructional strategies and the modern field of adaptive educational hypermedia.
Recent observations through experiments that we have performed in current third generation wireless networks have revealed that the achieved throughput over wireless links varies widely depending on the application. In particular, the throughput achieved by file transfer application (FTP) and web browsing application (HTTP) are quite different. The throughput achieved over a HTTP session is much lower than that achieved over an FTP session. The reason for the lower HTTP throughput is that the HTTP protocol is affected by the large Round-Trip Time (RTT) across Wireless links. HTTP transfers require multiple TCP connections and DNS lookups before a HTTP page can be displayed. Each TCP connection requires several RTTs to fully open the TCP send window and each DNS lookup requires several RTTs before resolving the domain name to IP mapping. These TCP/DNS RTTs significantly degrade the performance of HTTP over wireless links. To overcome these problems, we have developed session level optimization techniques to enhance HTTP download mechanisms. These techniques (a) minimize the number of DNS lookups over the wireless link and (b) minimize the number of TCP connections opened by the browser. These optimizations bridge the mismatch caused by wireless links between application-level protocols (such as HTTP) and transport-level protocols (such as TCP). Our solutions do not require any client-side software and can be deployed transparently on a service provider network to provide 30-50% decrease in end-to-end user perceived latency and 50-100% increase in data throughput across wireless links for HTTP sessions.
The emerging standards for the publication of Web Services are focused on the specification of the static interfaces of the operations to be invoked, or on the service composition. Few efforts have been made to specify the interaction between a Web Service and the individual consumer, although this aspect is essential to the successful service execution. In fact, while "one-shot" services may be invoked in a straight forward way, the invocation of services requiring complex interactions, where multiple messages are needed to complete the service, depends on the fact that the consumer respects the business logic of the Web Service. In this paper, we propose a framework for the server-side management of the interaction between a Web Service and its consumers. In our approach, the Web Service is in charge of assisting the consumer during the service invocation, by managing the interaction context and instructing the consumer about the operations that can be invoked and their actual parameters, at each step of the conversation. Our framework is based on the exchange of SOAP messages specifying the invocation of Java-based operations. Moreover, in order to support the interoperability with other software environments, the conversation flow specification is exported to a WSDL format that enables heterogeneous consumers to invoke the Web Service in a seamless way.
Previous work on understanding user web search behavior has focused on how people search and what they are searching for, but not why they are searching. In this paper, we describe a framework for understanding the underlying goals of user searches, and our experience in using the framework to manually classify queries from a web search engine. Our analysis suggests that so-called navigational" searches are less prevalent than generally believed while a previously unexplored "resource-seeking" goal may account for a large fraction of web searches. We also illustrate how this knowledge of user search goals might be used to improve future web search engines.
An increasingly large amount of Web applications employ service objects such as Servlets to generate dynamic and personalized content. Existing caching infrastructures are not well suited for caching such content in mobile environments because of disconnection and weak connection. One possible approach to this problem is to replicate Web-related application logic to client devices. The challenges to this approach are to deal with client devices that exhibit huge divergence in resource availabilities, to support applications that have different data sharing and coherency requirements, and to accommodate the same application under different deployment environments. The Replet system targets these challenges. It uses client, server and application capability and preference information (CPI) to direct the replication of service objects to client devices: from the selection of a device for replication and populating the device with client-specific data, to choosing an appropriate replica to serve a given request and maintaining the desired state consistency among replicas. The Replet system exploits on-device replication to enable client-, server- and application-specific cost metrics for replica invocation and synchronization. We have implemented a prototype in the context of Servlet-based Web applications. Our experiment and simulation results demonstrate the viability and significant benefits of CPI-driven on-device service object replication.
Web services make information and software available programmatically via the Internet and may be used as building blocks for applications. A composite web service is one that is built using multiple component web services and is typically specified using a language such as BPEL4WS or WSIPL. Once its specification has been developed, the composite service may be orchestrated either in a centralized or in a decentralized fashion. Decentralized orchestration offers performance improvements in terms of increased throughput and scalability and lower response time. However, decentralized orchestration also brings additional complexity to the system in terms of error recovery and fault handling. Further, incorrect design of a decentralized system can lead to potential deadlock or non-optimal usage of system resources. This paper investigates build time and runtime issues related to decentralized orchestration of composite web services. We support our design decisions with performance results obtained on a decentralized setup using BPEL4WS to describe the composite web services and BPWS4J as the underlying runtime environment to orchestrate them.
Web applications are becoming increasingly popular for mobile wireless PDAs. However, web browsing on these systems can be quite slow. An alternative approach is handheld thin-client computing, in which the web browser and associated application logic run on a server, which then sends simple screen updates to the PDA for display. To assess the viability of this thin-client approach, we compare the web browsing performance of thin clients against fat clients that run the web browser locally on a PDA. Our results show that thin clients can provide better web browsing performance compared to fat clients, both in terms of speed and ability to correctly display web content. Surprisingly, thin clients are faster even when having to send more data over the network. We characterize and analyze different design choices in various thin-client systems and explain why these approaches can yield superior web browsing performance on mobile wireless PDAs.
A requirements analysis in the emerging field of Semantic Web Services (SWS) (see http://daml.org/services/swsl/requirements/) has identified four major areas of research: intelligent service discovery, automated contracting of services, process modeling, and service enactment. This paper deals with the intersection of two of these areas: process modeling as it pertains to automated contracting. Specifically, we propose a logic, called CTR-S, which captures the dynamic aspects of contracting for services. Since CTR-S is an extension of the classical first-order logic, it is well-suited to model the static aspects of contracting as well. A distinctive feature of contracting is that it involves two or more parties in a potentially adversarial situation. CTR-S is designed to model this adversarial situation through its novel model theory, which incorporates certain game-theoretic concepts. In addition to the model theory, we develop a proof theory for CTR-S and demonstrate the use of the logic for modeling and reasoning about Web service contracts.
This paper presents a novel approach of using web log data generated by course management systems (CMS) to help instructors become aware of what is happening in distance learning classes. Specifically, techniques from Information Visualization are used to graphically render complex, multidimensional student tracking data collected by CMS. A system, called CourseVis, illustrates the proposed approach. Graphical representations from the use of CourseVis to visualise data from a Java on-line distance course ran with WebCT are presented. Findings from the evaluation of CourseVis are presented, and it is argued that CourseVis can help teachers become aware of some social, behavioural, and cognitive aspects related to distance learners. Using graphical representations of student tracking data, instructors can identify tendencies in their classes, or quickly discover individuals that need special attention.
XML has become one of the core technologies for contemporary business applications, especially web-based applications. To facilitate processing of diverse XML data, we propose an extensible, integrated XML processing architecture, the XML Virtual Machine (XVM), which connects XML data with their behaviors. At the same time, the XVM is also a framework for developing and deploying XML-based applications. Using component-based techniques, the XVM supports arbitrary granularity and provides a high degree of modularity and reusability. XVM components are dynamically loaded and composed during XML data processing. Using the XVM, both client-side and server-side XML applications can be developed and deployed in an integrated way. We also present an XML application container built on top of the XVM along with several sample applications to demonstrate the applicability of the XVM framework.
This paper describes one solution to the problem of how to select sequence, and link Web resources into a coherent, focused organization for instruction that addresses a user's immediate and focused learning need. A system is described that automatically generates individualized learning paths from a repository of XML Web resources. Each Web resource has an XML Learning Object Metadata (LOM) description consisting of General, Educational, and Classification metadata. Dynamic assembly of these learning objects is based on the relative match of the learning object content and metadata to the learner's needs, preferences, context, and constraints. Learning objects are connected into coherent paths based on their LOM topic classifications and the proximity of these topics in a Resource Description Framework (RDF) graph. An instructional sequencing policy specifies how to arrange the objects on the path into a particular learning sequence. The system has been deployed and evaluated within a corporate setting.
In the past few years, a number of constraint languages for XML documents has been proposed. They are cumulatively called schema languages or validation languages and they comprise, among others, DTD, XML Schema, RELAX NG, Schematron, DSD, xlinkit. One major point of discrimination among schema languages is the support of co-constraints, or co-occurrence constraints, e.g., requiring that attribute A is present if and only if attribute B is (or is not) present in the same element. Although there is no way in XML Schema to express these requirements, they are in fact frequently used in many XML document types, usually only expressed in plain human-readable text, and validated by means of special code modules by the relevant applications. In this paper we propose SchemaPath, a light extension of XML Schema to handle conditional constraints on XML documents. Two new constructs have been added to XML Schema: conditions -- based on XPath patterns -- on type assignments for elements and attributes; and a new simple type, xsd:error, for the direct expression of negative constraints (e.g. it is prohibited for attribute A to be present if attribute B is also present). A proof-of-concept implementation is provided. A Web interface is publicly accessible for experiments and assessments of the real expressiveness of the proposed extension.
Personalized support for learners becomes even more important, when e-Learning takes place in open and dynamic learning and information networks. This paper shows how to realize personalized learning support in distributed learning environments based on Semantic Web technologies. Our approach fills the existing gap between current adaptive educational systems with well-established personalization functionality, and open, dynamic learning repository networks. We propose a service-based architecture for establishing personalized e-Learning, where personalization functionality is provided by various web-services. A Personal Learning Assistant integrates personalization services and other supporting services, and provides the personalized access to learning resources in an e-Learning network.
Recently, active behavior has received attention in the XML field to automatically react to occurred events. Aside from proprietary approaches for enriching XML with active behavior, the W3C standardized the Document Object Model (DOM) Event Module for the detection of events in XML documents. When using any of these approaches, however, it is often impossible to decide which event to react upon because not a single event but a combination of multiple events, i.e., a composite event determines a situation to react upon. The paper presents the first approach for detecting composite events in XML documents by addressing the peculiarities of XML events which are caused by their hierarchical order in addition to their temporal order. It also provides for the detection of satisfied multiplicity constraints defined by XML schemas. Thereby the approach enables applications operating on XML documents to react to composite events which have richer semantics.
Content delivery networks have evolved beyond traditional distributed caching. With services such as Akamai's EdgeComputing it is now possible to deploy and run enterprise business Web applications on a globally distributed computing platform, to provide subsecond response time to end users anywhere in the world. Additionally, this distributed application platform provides high levels of fault-tolerance and scalability on-demand to meet virtually any need. Application resources can be provisioned dynamically in seconds to respond automatically to changes in load on a given application. In some cases, an application can be deployed completely on the global platform without any central enterprise infrastructure. Other applications can require centralizing core business logic and transactional databases at the enterprise data center while the presentation layer and some business logic and database functionality move onto the edge platform. Implementing a distributed application service on the Internet's edge requires overcoming numerous challenges, including sandboxing for security, distributed load-balancing and resource management, accounting and billing, deployment, testing, debugging, and monitoring. Our current implementation of Akamai EdgeComputing supports application programming platforms such as Java 2 Enterprise Edition (J2EE) and Microsoft's .NET Framework, in large part because they make it easier to address some of these challenges. In the near future we will also support environments for other application languages such as C, PHP, and Perl.
Many Web information services utilize techniques of information extraction (IE) to collect important facts from the Web. To create more advanced services, one possible method is to discover thematic information from the collected facts through text classification. However, most conventional text classification techniques rely on manual-labelled corpora and are thus ill-suited to cooperate with Web information services with open domains. In this work, we present a system named LiveClassifier that can automatically train classifiers through Web corpora based on user-defined topic hierarchies. Due to its flexibility and convenience, LiveClassifier can be easily adapted for various purposes. New Web information services can be created to fully exploit it; human users can use it to create classifiers for their personal applications. The effectiveness of classifiers created by LiveClassifier is well supported by empirical evidence.
The practical experience of RosettaNet in using Web technologies for B2B integration illustrates the transformative power of Web technologies and also highlights challenges for the future. This paper provides an overview of RosettaNet technical standards and discusses the lessons learned from the standardization efforts, in particular, what works and what doesn't. This paper also describes the effort to increase automation of B2B software integration, and thereby to reduce cost.
Interoperability is one of the main issues in creating a networked system of repositories. The eduSource project in its holistic approach to building a network of learning object repositories in Canada is implementing an open network for learning services. Its openness is supported by a communication protocol called the eduSource Communications Layer (ECL) which closely implements the IMS Digital Repository Interoperability (DRI) specification and architecture. The ECL in conjunction with connection middleware enables any service providers to join the network. EduSource is open to external initiatives as it explicitly supports an extensible bridging mechanism between eduSource and other major initiatives. This paper discusses interoperability in general and then focuses on the design of ECL as an implementation of IMS DRI with supporting infrastructure and middleware. The eduSource implementation is in the mature state of its development as being deployed in different settings with different partners. Two applications used in evaluating our approach are described: a gateway for connecting between eduSource and the NSDL initiative, and a federated search connecting eduSource, EdNA and SMETE.
We propose new features and algorithms for automating Web-page classification tasks such as content recommendation and ad blocking. We show that the automated classification of Web pages can be much improved if, instead of looking at their textual content, we consider each links's URL and the visual placement of those links on a referring page. These features are unusual: rather than being scalar measurements like word counts they are tree structured -- describing the position of the item in a tree. We develop a model and algorithm for machine learning using such tree-structured features. We apply our methods in automated tools for recognizing and blocking Web advertisements and for recommending "interesting" news stories to a reader. Experiments show that our algorithms are both faster and more accurate than those based on the text content of Web documents.
We present enhancements for UDDI / DAML-S registries allowing cooperative discovery and selection of Web services with a focus on personalization. To find the most useful service in each instance of a request, not only explicit parameters of the request have to be matched against the service offers. Also user preferences or implicit assumptions of a user with respect to common knowledge in a certain domain have to be considered to improve the quality of service provisioning. In the area of Web services the notion of service ontologies together with cooperative answering techniques can take a lot of this responsibility. However, without quality assessments for the relaxation of service requests and queries a personalized service discovery and selection is virtually impossible. This paper focuses on assessing the semantic meaning of query relaxation plans over multiple conceptual views of the service ontology, each one representing a soft query constraint of the user request. Our focus is on the question what constitutes a minimum amount of necessary relaxation to answer each individual request in a cooperative manner. Incorporating such assessments as early as possible we propose to integrate ontology-based discovery directly into UDDI directories or query facilities in service provisioning portals. Using the quality assessments presented here, this integration promises to propel today's Web services towards an intuitive user-centered service provisioning.
Recent studies show that a majority of Web page accesses are referred by search engines. In this paper we study the widespread use of Web search engines and its impact on the ecology of the Web. In particular, we study how much impact search engines have on the popularity evolution of Web pages. For example, given that search engines return currently popular" pages at the top of search results, are we somehow penalizing newly created pages that are not very well known yet? Are popular pages getting even more popular and new pages completely ignored? We first show that this unfortunate trend indeed exists on the Web through an experimental study based on real Web data. We then analytically estimate how much longer it takes for a new page to attract a large number of Web users when search engines return only popular pages at the top of search results. Our result shows that search engines can have an immensely worrisome impact on the discovery of new Web pages.
Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. Also, it has been proven that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. However, no uniform approach and model has been presented to measure the importance of different segments in web pages. Through a user study, we found that people do have a consistent view about the importance of blocks in web pages. In this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. We define the block importance estimation as a learning problem. First, we use a vision-based page segmentation algorithm to partition a web page into semantic blocks with a hierarchical structure. Then spatial features (such as position and size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. Based on these features, learning algorithms are used to train a model to assign importance to different segments in the web page. In our experiments, the best model can achieve the performance with Micro-F1 79% and
Service-oriented architectures (SOA) will provide the basis of the next generation of distributed software systems, and have already gained enormous traction in the industry through an XML-based instantiation, Web services. A central aspect of SOAs is the looser coupling between applications (services) that is achieved when services publish their functional and non-functional behavioral characteristics in a standardized, machine readable format. In this paper we argue that in the basic SOA model access to metadata is too static and results in inflexible interactions between requesters and providers. We propose specific extensions to the SOA model to allow service providers and requestors to dynamically expose and negotiate their public behavior, resulting in the ability to specialize and optimize the middleware supporting an interaction. We introduce a middleware architecture supporting this extended SOA functionality, and describe a conformant implementation based on standard Web services middleware. Finally, we demonstrate the advantages of this approach with a detailed real world scenario.
Multimodal interfaces are becoming increasingly ubiquitous with the advent of mobile devices, accessibility considerations, and novel software technologies that combine diverse interaction media. In addition to improving access and delivery capabilities, such interfaces enable flexible and personalized dialogs with websites, much like a conversation between humans. In this paper, we present a software framework for multimodal web interaction management that supports mixed-initiative dialogs between users and websites. A mixed-initiative dialog is one where the user and the website take turns changing the flow of interaction. The framework supports the functional specification and realization of such dialogs using staging transformations -- a theory for representing and reasoning about dialogs based on partial input. It supports multiple interaction interfaces, and offers sessioning, caching, and co-ordination functions through the use of an interaction manager. Two case studies are presented to illustrate the promise of this approach.
The Visual Software Circuit Board (VSCB) platform supports a component based development methodology towards the development of software systems. The circuit board design techniques and methodologies have evolved for electronic device and component engineering for decades. The circuit board approach, now applied for software systems and applications, makes the component based development process easy to visualize and comprehend. This paper describes the VSCB based design methodology with a specific focus on usage of VSCB for web application engineering.
It is a tedious and cumbersome process to update directly a WML document for the wireless Web because its content composes of both data and presentation. Thus, XML is used to handle the data while its XSLT stylesheet is used to extract and format the data for presentation. However, different stylesheets have to be used for different devices. An efficient and systematic method based on the idea of generating two separate sets of rules corresponding to content extracting and formatting parts of the stylesheet is described in this paper. The data extraction part is constructed from content rules while the formatting part is constructed from presentation rules. They are then combined together to form a stylesheet by an XSLT generator. A large number of stylesheets corresponding to different devices and a number of standard DTD documents or XML schemas can be generated in this way and stored in the pool during application setup stage. They will be individually selected from the pool by an XSLT engine to produce different WML documents for different devices during run time.
The Semantic Web relies on the complex interaction of several technologies involving ontologies. Therefore, sophisticated Semantic Web applications typically comprise more than one software module. Instead of coming up with proprietary solutions, developers should be able to rely on a generic infrastructure for application development in this context. We call such an infrastructure Application Server for the Semantic Web whose design and development are based on existing Application Servers. However, we apply and augment their underlying concepts for use in the Semantic Web and integrate semantic technology within the server itself. We provide a short overview of requirements and design issues of such a server and present our implementation and ongoing work KAON SERVER.
In this poster we introduce ProThes, a pilot meta-search engine (MSE) for a specific application domain. ProThes combines three approaches: meta-search, graphical user interface (GUI) for query specification, and thesaurus-based query techniques. ProThes attempts to employ domain-specific knowledge, which is represented by both a conceptual thesaurus and results ranking heuristics. Since the knowledge representation is separated from the MSE core, adjusting the system to a specific domain is trouble free. Thesaurus allows for manual query building and automatic query techniques. This poster outlines the overall system architecture, thesaurus representation format, and query operations. ProThes is implemented on J2EE platform as a Web service.
Collaborative Filtering (CF) technique has proved to be one of the most successful techniques in recommendation systems in recent years. However, traditional centralized CF system has suffered from its shortage in scalability as their calculation complexity increases quickly both in time and space when the record in user database increases. In this paper, we propose a decentralized CF algorithm, called PipeCF, based on distributed hash table (DHT) method. We also propose two novel approaches to improve the scalability and prediction accuracy of DHT-based CF algorithm. The experimental data show that our DHT-based CF system has better prediction accuracy, efficiency and scalability than traditional CF systems.
The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development. This situation is due mostly to a lack of formal definition of separation and fear that enforcing separation emasculates a template's power. I show that not only is strict separation a worthy design principle, but that we can enforce separation while providing a potent template engine. I demonstrate my StringTemplate engine, used to build jGuru.com and other commercial sites, at work solving some nontrivial generational tasks. My goal is to formalize the study of template engines, thus, providing a common nomenclature, a means of classifying template generational power, and a way to leverage interesting results from formal language theory. I classify three types of restricted templates analogous to Chomsky's type 1..3 grammar classes and formally define separation including the rules that embody separation. Because this paper provides a clear definition of model-view separation, template engine designers may no longer blindly claim enforcement of separation. Moreover, given theoretical arguments and empirical evidence, programmers no longer have an excuse to entangle model and view.
The digital generation of a story in which users have influence over the narrative is emerging as an exciting example of computer-based interactive entertainment. Interactive storytelling has existed in non digital versions for thousand of years, but with the advent of the Web the demand for enabling distributed cyberdrama generation is becoming increasingly common. To govern the complexity stemming from the distributed generation of complex plots, we have devised an event synchronization service that may be exploited to support the distribution of interactive storytelling activities over the Web. The main novelty of our approach is that the semantics of the cyberdrama is exploited to discard obsolete events. This brings to the positive result of speeding up the activity of drama generation, thus enabling an augmented interactivity among dispersed players.
The current discussion about a future Semantic Web trust architecture is focused on reputational trust mechanisms based on explicit trust ratings. What is often overlooked is the fact that, besides of ratings, huge parts of the application-specific data published on the Semantic Web are also trust relevant and therefore can be used for flexible, fine-grained trust evaluations. In this poster we propose the usage of context- and content-based trust mechanisms and outline a trust architecture which allows the formulation of subjective and task-specific trust policies as a combination of reputation-, context- and content-based trust mechanisms.
Interoperability has become one of the big problems of e-commerce since it was born. A number of B2B standards like ebXML, UDDI, RosettaNet, xCBL, etc. emerged recently to solve the interoperability problem. Currently, there exists many B2B standards each provide competing and complementary solutions to B2B interoperability. So, there is a need for serving implementation of these standards from a single, central store to ease the use and management of the implementations. This paper presents EIOP, an E-commerce Interoperability Platform. EIOP is designed to provide a central store for implementations of e-commerce specifications to be able to use and configure these implementations from a single, central point. It defines the term EIOP Component which corresponds to plug&play e-commerce applications that are stored in the EIOP.
Defining dependency models is sometimes an easier, more intuitive way for ontology representation than defining reactive rules directly, as it provides a higher level of abstraction. We will shortly introduce the ADI (Active Dependency Integration) model capabilities, emphasizing new developments: 1. Support of automatic dependencies instantiation from an abstract definition that expresses a general dependency in the ontology, namely a "template". 2. Inference of rules for dynamic dependency models where dependencies and entities may be inserted deleted and updated. We use the eTrade example in order to exemplify those capabilities.
In this paper, we present a localized Shortest Path Tree (SPT) based algorithm for constructing a sub-network with the minimum-energy property for a given wireless ad hoc network. Each mobile node determines its own transmission power based only on its local information. The proposed algorithm constructs local shortest path trees from the unit disk graph. The performance improvements of our algorithm are demonstrated through simulations.
There exist many portal servers that support the construction of "My" portals that is portals that allow the user to have one or more personal pages composed of a number of personalizable services. The main drawback of current portal servers is their lack of generality and adaptability. This paper presents the design of MyPersonalizer a J2EE-based framework for engineering My portals. The framework is structured according to the Model-View-Controller and Layers architectural patterns providing generic adaptable model and controller layers that implement the typical use cases of a My portal. MyPersonalizer allows for a good separation of roles in the development team: graphical designers (without programming skills) develop the portal view by writing JSP pages while software engineers implement service plugins and specify framework configuration.
Enterprises today wish to manage their IT resources so as to optimize business objectives, such as income, rather than IT metrics, such as response times. Therefore, we introduce a new paradigm, which focuses on such business objective oriented resource management. Additionally, we define a general simulation-based autonomous process enabling such optimizations, and describe a case study, demonstrating the usefulness of such a process.
Nowadays, the leading e-learning platforms are converging towards standardization. This paper presents an extension to the SCORM, today's most well acclaimed e-learning standard, enabling the modelling of course related entities that surround learning objects and content aggregations, therefore increasing the standard's modelling scope and allowing for gains in efficiency in knowledge dissemination. A prototype is being implemented and tested on VIANET, an original e-learning platform with extensible support for the SCORM. content aggregations.
Conventionally, Web pages have been recognized as documents described by HTML. Image data, such as photographs, logos, maps, illustrations, and decorated text, have been treated as sub-components of Web documents. However, we can alternatively recognize all Web pages as images on the screen. When a Web page is treated as an image, its HTML data is considered to be metadata which describes the image content. Taking such a viewpoint, we propose a new image-based hypermedia which we call continuous web. In our model, there is no distinction between Web images and other images such as photographs. Regarding everything on the Web as images leads us to consider a new style of browsing and navigating. We use the term scape-oriented browsing. We define a scape as a collection of continuously accumulated images. For example, whenever we walk in the real world, we can perceive and remember various forms of information through a scape process. Here, we describe new methods for scape-oriented browsing, such as see-through anchors, parallel navigation, and peripheral scape presentation. We have designed and implemented a prototype system based on our model. Our system offers continuous browsing and navigation to users. We explain our concepts and discuss the effectiveness and potential of this approach.
Web Information Systems (WIS) support the process of retrieving information from sources on the Web and of presenting them as a hypermedia presentation. Most WIS design methodologies focus on the engineering of the abstract navigation (hyperlinks). The actual presentation generation is less supported. Hera is one of the few WIS methodologies that offer a tool for presentation generation (HPG). The HPG transforms RDF data obtained as the result of a query into a Web presentation suited to the user (in HTML or WML).
Given location information on digital photographs, we can automatically generate an abundance of photo-related metadata using off-the-shelf and web-based data sources. These metadata can serve as additional memory cues and filters when browsing a personal or global collection of photos.
This paper investigates how the vision of the Semantic Web can be carried over to the realm of email. We introduce a general notion of semantic email, in which an email message consists of an RDF query or update coupled with corresponding explanatory text. Semantic email opens the door to a wide range of automated, email-mediated applications with formally guaranteed properties. In particular, this paper introduces a broad class of semantic email processes. For example consider the process of sending an email to a program committee asking who will attend the PC dinner automatically collecting the responses and tallying them up. We define both logical and decision-theoretic models where an email process is modeled as a set of updates to a data set on which we specify goals via certain constraints or utilities. We then describe a set of inference problems that arise while trying to satisfy these goals and analyze their computational tractability. In particular we show that for the logical model it is possible to automatically infer which email responses are acceptable w.r.t. a set of constraints in polynomial time and for the decision-theoretic model it is possible to compute the optimal message-handling policy in polynomial time. Finally we discuss our publicly available implementation of semantic email and outline research challenges in this realm.
An active e-course is a self-representable and self-organizable document mechanism with a flexible structure. The kernel of the active e-course is to organize learning materials into a "concept space" rather than a "page space". Besides highly interactive service it supports adaptive learning by dynamically selecting organizing and presenting the learning materials for different students. During the learning progress it also provides assessments on students' learning performances and gives suggestions to guide them in further learning. We have implemented an authoring tool and a course prototype to support the constructivist learning.
When human guess the content of a web page, not only the text on the page but also its appearance is an important factor. However, there have been few studies on the relationship between the content and visual appearance of a web page. We investigating the tendency between them, especially web content and color use, we found a tendency to use color for some kinds of content pages. We think this result opens the way to estimating web content using color information.
In this paper, we propose a novel multicast streaming protocol for overlay networks, called Gossip Based Streaming (GBS). In GBS, streaming contents are not come from a single upstream source, but delivered from several sources to a client. Though being similar to existing gossip protocols, the unique requirements for streaming, such as continuous playback, are addressed in our design. Preliminary results show that GBS performs much better in dynamic user environments.
We propose a system called "Adaptation Anywhere & Anytime (A3)", which is a framework for making web sites/applications adaptable to user's needs or interests, and we describe the implement of a web site on A3 by using XSLT. Web sites/applications built on A3 construct user ontologies for each user automatically and share them between sites/applications. Each site/application uses the user ontology to select an appropriate resource for the user and to present such resources in a suitable form. And A3 offers the method for constructing the adaptable web sites using XSLT. The author of web sites can easily make their sites adaptable by using XSLT.
The web graph follows the power law distribution and has a hierarchy structure. But neither the PageRank algorithm nor any of its improvements leverage these attributes. In this paper, we propose a novel link analysis algorithm "the PowerRank algorithm", which makes use of the power law distribution attribute and the hierarchy structure of the web graph. The algorithm consists two parts. In the first part, special treatment is applied to the web pages with low "importance" score. In the second part, the global "importance" score for each web page is obtained by combining those scores together. Our experimental results show that: 1) The PowerRank algorithm computes 10%-30% faster than PageRank algorithm. 2) Top web pages in PowerRank algorithm remain similar to that of the PageRank algorithm.
Two important architectural choices underlie the success of the Web: numerous, independently operated servers speak a common protocol, and a single type of client the Web browser provides point-and-click access to the content and services on these decentralized servers. However, because HTML marries content and presentation into a single representation, end users are often stuck with inappropriate choices made by the Web site designer of how to work with and view the content. RDF metadata on the Semantic Web does not have this limitation: users can gain direct access to information and control over how it is presented. This principle forms the basis for our Semantic Web browser an end user application that automatically locates metadata and assembles point-and-click interfaces from a combination of relevant information, ontological specifications, and presentation knowledge, all described in RDF and retrieved dynamically from the Semantic Web. Because data and services are accessed directly through a standalone client and not through a central point of access (e.g., a portal), new content and services can be consumed as soon as they become available. In this way we take advantage of an important sociological force that encourages the production of new Semantic Web content while remaining faithful to the decentralized nature of the Web.
In this paper, we propose a Web based information sharing system called the Proxy Agent-based Information Sharing (PAIS). We also developed a writable Web mechanism called Web browser-based Direct Editing (Wedit), that is a major component of PAIS. Wedit enables public users to effectively edit HTML text on an existing Web browser. Since Wedit was developed with conventional technologies, users quickly learn how to use it. PAIS is implemented by using Wedit and a proxy agent. PAIS enables users to share information via Web pages using Wedit. The proxy agent maintains users' editing data. The agent autonomously sends its user's modification data to other agents in the same community. In PAIS, certain confidential information in the community is not publicly shared by using the proxy agent.
This paper describes a query algebra for queries over XML p2p databases that provides explicit mechanisms for modeling data dissemination, replication constraints, and for capturing the transient nature of data and replicas.
The development of information and communication technologies and the expansion of the Internet means that, nowadays, there are huge amounts of information available via these emergent media. A number of content management systems have appeared which aim to support the management of these large amounts of content. Most of these systems do not support collaboration among several, distributed sources of managed content. In this paper we present a proposal for an architecture, Infoflex, for the efficient and flexible management of distributed content using Next Generation Web Technologies: Web Services and Semantic Web facilities.
Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can hint at the category of the resource. This paper explores the use of URLs for webpage categorization via a two-phase pipeline of word segmentation/expansion and classification. We quantify its performance against document-based methods, which require the retrieval of the source document.
In this paper we describe a practical framework for studying then a navigational behavior of the users of an e-learning environment integrated in a virtual campus. The students navigate through the web based virtual campus interacting with learning resources which are structured following the SCORM e-learning standard. Our main goal is to design a usage mining tool for analyzing such user navigational behavior and for extracting relevant information that can be used to validate several aspects related to virtual campus design and usability but also to determine the optimal scheduling for each course depending on user profile. We intend to extend the sequencing capabilities of the SCORM standard to include the concept of recommended itinerary, by combining teachers expertise with learned experience acquired by system usage analysis.
The goal of this research is the improvement of browsing voluminous InkML data in two areas: ease of rendering continuous ink-flow for replay-browsing, and ease of random access navigation in eLearning domains. The notion of real-time random access navigation in ink documents has not yet been fully exploited. Users of existing eLearning browsers are restricted to viewing static annotated slides that are inferior in quality when compared to actively replaying the same slides with sequenced ink-flow of the annotated freehand writings. We are developing a tool to investigate ways of managing massive InkML data for efficient "active visible scrolling" of recorded freehand writings in ink documents. This work will also develop and evaluate new post-processing techniques that take advantage of the relationship between ink volumes and active-rendering times for real-time random access navigation.
The Web Ontology Language (OWL) defines three classes of documents: Lite, DL, and Full. All RDF/XML documents are OWL Full documents, some OWL Full documents are also OWL DL documents, and some OWL DL documents are also OWL Lite documents. This paper discusses parsing and species recognition -- that is the process of determining whether a given document falls into the OWL Lite, DL or Full class. We describe two alternative approaches to this task, one based on abstract syntax trees, the other on RDF triples, and compare their key characteristics.
Service matchmaking and composition has recently drawn increasing attention in the research community. Most existing algorithms construct chains of services based on exact matches of input/output types. However, this does not work when the available services only cover a part of the range of the input type. We present an algorithm that also allows partial matches and composes them using switches that decide on the required service at runtime based on the actual data type. We report experiments on randomly generated composition problems that show that using partial matches can decrease the failure rate of the integration algorithm using only complete matches by up to 7 times with no increase in the number of directory accesses required. This shows that composition with partial matches is an essential and useful element of web service composition.
Semantic Web is challenged by the URI meaning issues arising from putting ontologies in open and distributed environments. As a try to clarify some of the meaning issues, this paper proposes a new approach to interpreting distributed ontologies, it's built on the top of local models semantics, and extends it to deal with the URI sharing by harmonizing the local models via agreement on vocabulary provenance. The commitment relationship is presented to allow the URI sharing between ontologies with richer semantics.
This poster presents an overview of the work on an interaction manager of a platform for multimodal applications in 2.5G and 3G mobile phone networks and WLAN environments. The poster describes the requirements for the interaction manager (IM), its tasks and the resulting structure. We examine the W3C's definition of an interaction manager and compare it to our implementation, which accomplishes some additional tasks.
The future Web can be imagined as a life network consisting of resource nodes and semantic relationship links between them. Any node has a life span from birth -- adding it to the network -- to death -- removing it from the network. Through establishing and investigating two types of models for such a network, we obtain the same scale free distribution of semantic links. Simulations and comparisons validate the rationality of the proposed models.
We describe a strategy to support the semantic annotation of contested knowledge, in the context of the Scholarly Ontologies project, which aims at building a network of interpretations enriching a corpus of scholarly papers. To model such knowledge, which does not have 'right' and 'wrong' values, we are building on the notion of active recommendations as a means to sparkle annotators' interest. We finally argue for a different approach to the evaluation of its impact.
This paper presents a method for admission control and request scheduling for multiply-tiered e-commerce Web sites, achieving both stable behavior during overload and improved response times. Our method externally observes execution costs of requests online, distinguishing different request types, and performs overload protection and preferential scheduling using relatively simple measurements and a straight forward control mechanism. Unlike previous proposals, which require extensive changes to the server or operating system, our method requires no modifications to the host O.S., Web server, application server or database. Since our method is external, it can be implemented in a proxy. We present such an implementation, called Gatekeeper, using it with standard software components on the Linux operating system. We evaluate the proxy using the industry standard TPC-W workload generator in a typical three-tiered e-commerce environment. We show consistent performance during overload and throughput increases of up to 10 percent. Response time improves by up to a factor of 14, with only a 15 percent penalty to large jobs.
An important obstacle to the success of the Semantic Web is that the establishment of the semantic relationship is labor-intensive. This paper proposes an automatic semantic relationship discovering approach for constructing the semantic link network. The basic premise of this work is that the semantics of a web page can be reflected by a set of keywords, and the semantic relationship between two web pages can be determined by the semantic relationship between their keyword sets. The approach adopts the data mining algorithms to discover the semantic relationships between keyword sets, and then uses deductive and analogical reasoning to enrich the semantic relationships. The proposed algorithms have been implemented. Experiment shows that the approach is feasible.
The number of Web blogs is growing extremely fast, thus this phenomenon cannot be ignored. This paper discusses the issue through monitoring a set of blogs for a two months period in September-October 2003 and characterizing these blogs based on descriptive statistics and content analysis.
On-demand broadcast has been supported in the Internet to enhance system scalability. Unfortunately, most of existing on-demand scheduling algorithms did not consider the time constraints associated with web requests. This paper proposes a novel scheduling algorithm, called Slack Inverse Number of requests (SIN), that takes into account the urgency and productivity of serving pending requests. Trace-driven experiments demonstrate that SIN significantly out performs existing algorithms over a wide range of workloads.
This paper explains our research and implementations of manual, automatic and deep annotations of provenance logs for e-Science in silico experiments. Compared to annotating general Web documents, annotations for scientific data require more sophisticated professional knowledge to recognize concepts from documents, and more complex text extraction and mapping mechanisms. A simple automatic annotation approach based on "lexicons" and a deep annotation implemented by semantically populating, translating and annotating provenance logs are introduced in this paper. We used COHSE (Conceptual Open Hypermedia Services Environment) to annotate and browse provenance logs from my Grid project, which are conceptually linked together as a hypertext Web of provenance logs and experiment resources, based on the associated conceptual metadata and reasoning over these metadata.
To keep an overview of a complex corporate web sites, it is crucial to understand the relationship of contents, structure and the user's behavior. In this paper, we describe an approach which is allowing us to compare web page content with the information implicitly defined by the structure of the web site. We start by describing each web page with a set of key words. We combine this information with the link structure in an algorithm generating a context based description. By comparing both descriptions, we draw conclusions about the semantic relationship of a web page and its neighbourhood. In this way, we indicate whether a page fits in the content of its neighbourhood. Doing this, we implicitly identify topics which span over several connected web pages. With our approach we support redesign processes by assessing the actual structure and content of a web site with designer's concepts.
The overwhelming success of the Web as a mechanism for facilitating information retrieval and for conducting business transactions has led to an increase in the deployment of complex enterprise applications. These applications typically run on Web Application Servers, which assume the burden of managing many tasks, such as concurrency, memory management, database access, etc., required by these applications. The performance of an Application Server depends heavily on appropriate configuration. Configuration is a difficult and error-prone task dueto the large number of configuration parameters and complex interactions between them. We formulate the problem of finding an optimal configuration for a given application as a black-box optimization problem. We propose a smart hill-climbing algorithm using ideas of importance sampling and Latin Hypercube Sampling (LHS). The algorithm is efficient in both searching and random sampling. It consists of estimating a local function, and then, hill-climbing in the steepest descent direction. The algorithm also learns from past searches and restarts in a smart and selective fashion using the idea of importance sampling. We have carried out extensive experiments with an on-line brokerage application running in a WebSphere environment. Empirical results demonstrate that our algorithm is more efficient than and superior to traditional heuristic methods.
In the past few years, web usage mining techniques have grown rapidly together with the explosive growth of the web, both in the research and commercial areas. In this work we present a Web mining strategy for Web personalization based on a novel pattern recognition strategy which analyzes and classifies both static and dynamic features. The results of experiments on the data from a large commercial web site are presented to show the effectiveness of the proposed system.
In this paper, we describe the notion of a semantic information portal. This is a community information portal that exploits the semantic web standards to improve structure, extensibility, customization and sustainability. We are in the process of developing a prototype directory of environmental organizations as a demonstration of the approach and outline the design challenges involved and the current status of the work.
This paper presents an algorithm to bound the bandwidth of a Web crawler. The crawler collects statistics on the transfer rate of each server to predict the expected bandwidth use for future downloads. The prediction allows us to activate the optimal number of fetcher threads in order to exploit the assigned bandwidth. The experimental results show the effectiveness of the proposed technique.
For most Web-based applications, contents are created dynamically based on the current state of a business, such as product prices and inventory, stored in database systems. These applications demand personalized content and track user behavior while maintaining application integrity. Many of such practices are not compatible with Web acceleration solutions. Consequently, although many web acceleration solutions have shown promising performance improvement and scalability, architecting and engineering distributed enterprise Web applications to utilize available content delivery networks remains a challenge. In this paper, we examine the challenge to accelerate J2EE-based enterprise web applications. We list obstacles and recommend some practices to transform typical database-driven J2EE applications to cache friendly Web applications where Web acceleration solutions can be applied. Furthermore, such transformation should be done without modification to the underlying application business logic and without sacrificing functions that are essential to e-commerce. We take the J2EE reference software, the Java PetStore, as a case study. By using the proposed guideline, we are able to cache more than 90% of the content in the PetStore and scale up the Web site more than 20 times.
Fragment-based caching has been proposed as a promising technique for dynamic Web content delivery and caching. Most of these approaches either assume the fragment-based content is served by Web server automatically, or look at server-side caching only. There is no method of extracting fragments from an existing dynamic Web content, which is of great importance to the success of fragment-based caching. Also, current technologies for supporting dynamic fragments do not allow to take into account changes in fragment spatiality, which is a popular technique in dynamic and personalized Web site design. This paper describes our effort to address these short comings. The first, DyCA, a Dynamic Content Adapter, is a tool for creating fragment-based content from original dynamic content. Our second proposal is an augmentation to the ESI standard that will allow it to support looking up fragment locations in a mapping table that comes attached with the template. This allows the fragments to move across the document without needing to reserve the template.
It is increasingly common for users to interact with the web using a number of different aliases. This trend is a double-edged sword. On one hand, it is a fundamental building block in approaches to online privacy. On the other hand, there are economic and social consequences to allowing each user an arbitrary number of free aliases. Thus, there is great interest in understanding the fundamental issues in obscuring the identities behind aliases. However, most work in the area has focused on linking aliases through analysis of lower-level properties of interactions such as network routes. We show that aliases that actively post text on the web can be linked together through analysis of that text. We study a large number of users posting on bulletin boards, and develop algorithms to anti-alias those users: we can with a high degree of success identify when two aliases belong to the same individual. Our results show that such techniques are surprisingly effective, leading us to conclude that guaranteeing privacy among aliases that post actively requires mechanisms that do not yet exist.
We present a Context Ultra-Sensitive Approach based on two-step Recommender systems (CUSA-2-step-Rec). Our approach relies on a committee of profile-specific neural networks. This approach provides recommendations that are accurate and fast to train because only the URLs relevant to a specific profile are used to define the architecture of each network. We compare the proposed approach with collaborative filtering showing that our approach achieves higher coverage and precision while being faster, and requiring lower main memory at recommendation time. While most recommenders are inherently context sensitive, our approach is context ultra-sensitive because a different recommendation model is designed for each profile separately.
We present an evaluation of four knowledge base systems with respect to use in large Semantic Web applications. We discuss the performance of each system. In particular, we show that existing systems need to place a greater emphasis on scalability.
The wealth of information available on the web makes it an attractive resource for seeking quick answers. While web-based question answering becomes an emerging topic in recent years, the problem of efficiently locating a complete set of distinct answers on the Web is far from being solved. We introduce our system, FADA, which relies on question event analysis, web page clustering, and natural language parsing, to find reliable distinct answers with high recall. The method has been found to be effective in strengthening state-of-the-art Web question answering techniques by emphasizing on answer completeness and uniqueness.
The meaning of names (URI references) is a contentious issue in the Semantic Web. Numerous proposals have been given for how to provide meaning for names in the Semantic Web, ranging from a strict localized model-theoretic semantics to proposals for a unified single meaning. We argue that a slight expansion of the standard model-theoretic semantics for names is sufficient for the present, and can easily be augmented where necessary to allow communities of interest to strengthen this spartan theory of meaning.
Business processes involve interactions among autonomous partners. We propose that these interactions be specified modularly as protocols. Protocols can be published, enabling implementors to independently develop components that respect published protocols and yet serve diverse interests. A variety of business protocols would be needed to capture subtle business needs. We propose that the same kinds of conceptual abstractions be developed for protocols as for information models. Specifically, we consider (1) refinement: a subprotocol may satisfy the requirements of a superprotocol, but support additional properties and (2) aggregation: a protocol may combine existing protocols. In support of the above, we develop a formal semantics for protocols, an operational characterization of them, and an algebra for protocol composition.
The celebrated PageRank algorithm has proved to be a very effective paradigm for ranking results of web search algorithms. In this paper we refine this basic paradigm to take into account several evolving prominent features of the web, and propose several algorithmic innovations. First, we analyze features of the rapidly growing "frontier" of the web, namely the part of the web that crawlers are unable to cover for one reason or another. We analyze the effect of these pages and find it to be significant. We suggest ways to improve the quality of ranking by modeling the growing presence of "link rot" on the web as more sites and pages fall out of maintenance. Finally we suggest new methods of ranking that are motivated by the hierarchical structure of the web, are more efficient than PageRank, and may be more resistant to direct manipulation.
We believe it is important for web graphic standards such as SVG to support user interaction and diagrams that can adapt their layout and appearance to their viewing context so as to take into account viewing device characteristics and the viewer's requirements. Previously we suggested that adding expression-based attributes to SVG and using one-way constraints to evaluate these dynamically would considerably improve SVG's support for adaptive layout and user interaction. We describe a minimal backward compatible extension to SVG 1.1, called Constraint SVG (CSVG), that provides such expression-based attributes and its implementation on top of Batik. CSVG also provides another significant extension to SVG 1.1: it allows the author to define new custom elements using XSLT.
This paper introduces a methodology to provide the first characterization of public Web Services in terms of their evolution, location, complexity, message size, and response time.
Often Web database users experience difficulty in articulating their needs using a precise query. Providing ranked set of possible answers would benefit such users. We propose to provide ranked answers to user queries by identifying a set of queries from the query log whose answers are relevant to the given user query. The relevance detection is done using a domain and end-user independent content similarity estimation technique.
Existing commercial Web browsers provide various utilities and functions, e.g., Web bookmarks and a browsing history list. Since the bookmark and history functions only the title and URL of the Web page, users who cannot remember the contents of each Web page have difficulty retracing their steps. In this paper, we propose a bookmark system based on a 3D interface. Additionally, our system offers three main functions a 3D browsing history function, a marker function, and a look-ahead loading function. These functions enable users to browse Web pages more effectively.
As programmable mobile devices (such as high-end cellular phones and Personal Digital Assistants) became widely adopted, users ask for Internet access on-the-road. While upcoming technologies like UMTS and Wi-Fi provide broadband wireless communication, Web services and Web browsers do not provide any sort of location-awareness yet. As GPS receivers get cheaper, positioning devices will be embedded into commercial mobile devices. Thus, the position of the user can be used to filter and tailor the information presented to the user as already done for language preferences and user-agent. This paper describes early results of an ongoing project called GPSWeb, which aims to provide GPS support for Web browsers and an application model for Location-Based Services. It introduces the Location-Based Browsing concept that enhances the classic Webuser-Website interaction.
Web link analysis has proven to be a significant enhancement for quality based web search. Most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). Unfortunately, most link analysis research only considers one type of link. In this paper, we propose a unified link analysis framework, called "link fusion", which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. The PageRank and HITS algorithms are shown to be special cases of our unified link analysis framework. Experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the HITS and DirectHit algorithms by 24.6% and 38.2% respectively.
We present a system that tries to automatically collect and monitor Japanese blog collections that include not only ones made with blog softwares but also ones written as normal web pages. Our approach is based on extraction of date expressions and analysis of HTML documents. Our system also extracts and mines useful information from the collected blog pages.
We have developed a set of hardware and software components to realize ubiquitous computing environments, based on two keywords, simple" (easy to implement) and "open"(adopt widely publicized specifications). Then this set has been resulted into UBKit (Ubiquity Building Toolkit). The Micro-Server an instance of UBKitenables existing consumer electronics to join in computer networks. In this paper we propose a scheme for discovery and control of devices attached to micro-servers."
This paper proposes an Ontology-based Rights Expression Language, called OREL. Based on OWL Web Ontology Language, OREL allows not only users but also machines to handle digital rights at semantics level. The ontology-based rights model of OREL is also presented. The usage of OREL and its advantages against existing RELs are discussed.
A fundamental task on the Grid is to decide what jobs to run on what computing resources based on job or application requirements. Our previous work on ontology-based matchmaking discusses a resource matchmaking mechanism using Semantic Web technologies. We extend our previous work to provide dynamic access to such matchmaking capability by building a persistent online matchmaking service. Our implementation uses the Globus Toolkit for the Grid service development, and exploits the monitoring and discovery service in the Grid infrastructure to dynamically discover and update resource information. We describe the architecture of our semantic matchmaker service in the poster.
We present a variant of PageRank, WLRank, that considers different Web page attributes to give more weight to some links. Our evaluation shows that the precision of the answers can improve significantly.
The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as information resources. Such neighborhoods are identified only by frustrated searchers, seeking a way out of these stale neighborhoods, back to more up-to-date sections of the web; measuring the decay of a page purely on the basis of dead links on the page is too naive to reflect this frustration. In this paper we formalize a strong notion of a decay measure and present algorithms for computing it efficiently. We explore this measure by presenting a number of validations, and use it to identify interesting artifacts on today's web. We then describe a number of applications of such a measure to search engines, web page maintainers, ontologists, and individual users.
In this paper, we propose a framework called CC-Buddy, for maintaining dynamic data coherency in peer-to-peer environment. Working on the basis of peer heterogeneity in data coherency requirement, peers in CC-Buddy cooperate with each other to disseminate the updates by pushing. Simulation results show that our solution not only improves the fidelity in data, but also reduces the workload of servers, therefore achieves high-scalability.
This work specifically addresses the search issues in unstructured peer-to-peer (P2P) systems that involve the design of an efficient search algorithm, the proposed dynamic search, and the modeling of P2P systems reflecting real measured P2P networks. Through simulations, we will show dynamic search outperforms other existing ones in terms of performance aspects.
This paper proposes a novel approach to integrate heterogeneous data in P2P networks. The approach includes a tool for building P2P semantic link networks, mechanisms for peer schema mapping, criteria for peer similarity degree measurement, and algorithms for heterogeneous data integration. The approach has three advantages: First, it uses semantic links to describe semantic relationships between peers' data schemas. Second, it deals with the semantic heterogeneity, the structural heterogeneity and the data value inconsistency. Finally, it considers the semantic similarity and structural similarity to forward queries to relevant peers.
The evaluation and assessment of physicians-in-training (house staff) is a complex task. Residency training programs are under increasing pressure [1] to provide accurate and comprehensive evaluations of performance of resident physicians [2,3]. For many years, the Internal Medicine training program at NYU School of Medicine used a single standardized paper form for all evaluation scenarios. This strategy was inadequate as physicians train in multiple diverse settings evaluation of physicians in the intensive care unit is quite different from those in the general clinics. The paper system resulted in poor compliance by house staff and faculty in the completion of evaluations. In addition, the data being collected from the paper forms was of poor quality due to the non-specific nature of the questions. A committee was formed in 2001, which created a new strategy for evaluating the core competencies of house staff. Given the ubiquity of web accessible computers in the clinical and non-clinical areas of hospitals and the flexibility a computerized system would provide, a web-based evaluation system was designed and implemented. This system allows for on-the-spot evaluations tailored to the evaluator, evaluatee and the venue of the evaluation. During the 2002 residency year, data was collected on satisfaction and use of the system and compared with the previous paper evaluation.
Maximizing only the relevance between queries and documents will not satisfy users if they want the top search results to present a wide coverage of topics by a few representative documents. In this paper, we propose two new metrics to evaluate the performance of information retrieval: diversity, which measures the topic coverage of a group of documents, and information richness, which measures the amount of information contained in a document. Then we present a novel ranking scheme, Affinity Rank, which utilizes these two metrics to improve search results. We demonstrate how Affinity Rank works by a toy data set, and verify our method by experiments on real-world data sets.
Delivering web pages to mobile phones or personal digital assistants has become possible with the latest wireless technology. However, mobile devices have very small screen sizes and memory capacities. Converting web pages for delivery to a mobile device is an exciting new problem. In this paper, we propose to use a ranking algorithm similar to Google's PageRank algorithm to rank the content objects within a web page. This allows the extraction of only important parts of web pages for delivery to mobile devices. Experiments show that the new method is effective. In experiments on pages from randomly selected websites, the system needed to extract and deliver only 39% of the objects in a web page in order to provide 85% of a viewer's desired viewing content. This provides significant savings in the wireless traffic and downloading time while providing a satisfactory reading experience on the mobile device.
The increased importance of XML as a universal data representation format has led to several proposals for enabling the development of applications that operate on XML data. These proposals range from runtime API-based interfaces to XML-based programming languages. The subject of this paper is XJ, a research language that proposes novel mechanisms for the integration of XML as a first-class construct into Java. The design goals of XJ distinguish it from pastwork on integrating XML support into programming languages -- specifically, the XJ design adheres to the XML Schema and XPath standards, and supports in-place updates of XML data thereby keeping with the imperative nature of Java. We have also built a prototype compiler for XJ, and our preliminary experimental results demonstrate that the performance of XJ programs can approach that of tradition allow level API-based interfaces, while providing a higher level of abstraction.
This paper presents an approach and a toolset for exploiting the benefits of conceptual modeling in the quality evaluation tasks that take place both before the deployment and during the operational life of a Web application. The full version of the paper is available as a technical report at the address: http://www.elet.polimi.it/upload/fraterna/FLMM2004.pdf.
Summarizing web pages have recently gained much attention from researchers. Until now two main types of approaches have been proposed for this task: content- and context-based methods. Both of them assume fixed content and characteristics of web documents without considering their dynamic nature. However the volatility of information published on the Internet argue for the implementation of more time-aware techniques. This paper proposes a new approach towards automatic web page description, which extends the concept of a web page by the temporal dimension. Our method provides a broader view on web document summarization and can complement the existing techniques.
This paper provides an objective evaluation of the performance impacts of binary XML encodings, using a fast stream-based XQuery processor as our representative application. Instead of proposing one binary format and comparing it against standard XML parsers, we investigate the individual effects of several binary encoding techniques that are shared by many proposals. Our goal is to provide a deeper understanding of the performance impacts of binary XML encodings in order to clarify the ongoing and often contentious debate over their merits, particularly in the domain of high performance XML stream processing.
Search results generated by searchable databases are served dynamically and far larger than the static documents on the Web. These results pages have been referred to as the Deep Web. We need to extract the target data in results pages to integrate them on different searchable databases. We propose a test bed for information extraction from search results. We chose 100 databases randomly from 114,540 pages with search forms. Therefore, these databases have a good variety. We selected 51 databases which include URLs in a results page and manually identify target information to be extracted. We also suggest evaluation measures for comparing extraction methods and methods for extending the target data.
The ontological representation of learning objects is a way to deal with the interoperability and reusability of learning objects (including metadata) through providing a semantic infrastructure that will explicitly declare the semantics and forms of concepts used in labeling learning objects. This paper reports the preliminary result from a learning object ontology construction project, which includes an in-depth study of 14 learning objects and over 500 components in these learning objects. An analysis of the types of components and terms used in these objects reveals that most terms fell into the form and subject categories few pedagogical terms were used. Drawing findings from literature and case study, the authors use a matrix to show relationships in learning objects and relevant knowledge and technologies. Strategies and methods in ontology development and implementation are also discussed.
Participation in the web of communities requires a common language, a common technological structure and development of content that is relevant and captivating. This paper reports on a project that both conserves a rich regional cultural heritage and has structured the content developed during this conservation to be fluidly shared with both the domain and the broader communities. It also examines the varied degrees of acceptance within these communities.
To improve the process of user information retrieval, we propose the concept of a latent semantic map (LSM), along with a method of generating this map. The novel aspect of the LSM is that it can archive user models and latent semantic analysis on one map to support instantaneous information retrieval. With this characteristic, the LSM can improve search engines in terms of not only user support but also search results.
Video is one of the most popular data shared in the Web, and the protection of video copyright is of vast interest. In this paper, we present a comprehensive approach for protecting and managing video copyrights in the Internet with watermarking techniques. We propose a novel hybrid digital video watermarking scheme with scrambled watermarks and error correction codes. The effectiveness of this scheme is verified through a series of experiments, and the robustness of our approach is demonstrated using the criteria of the latest StirMark test.
Automatically generated HTML, as produced by WYSIWYG programs, typically contains much repetitive and unnecessary markup. This paper identifies aspects of such HTML that may be altered while leaving a semantically equivalent document, and proposes techniques to achieve optimizing modifications. These techniques include attribute re-arrangement via dynamic programming, the use of style classes, and dead-code removal. These techniques produce documents as small as 33% of original size. The size decreases obtained are still significant when the techniques are used in combination with conventional text-based compression.
Query flooding is a problem existing in Peer-to-Peer networks like Gnutella. Firework Query Model solves this problem by Peer Clustering and routes the query message more intelligently. However, it still contains drawbacks like query flooding inside clusters. The condition can be improved if the query message can send directly to the query destination, as the message does not need to send hop by hop. This can be achieved by ranking. By ranking, the network can know the destination and the information quality shared by each peer. We introduce distributed ranking in this paper. We give background of FQM, outline of the proposed method, and conduct a series of experiments that demonstrate the significant reduction of query flooding in a P2P network.
We propose a decentralized privacy-preserving approach to spam filtering. Our solution exploits robust digests to identify messages that are a slight variation of one another and a peer-to-peer architecture between mail servers to collaboratively share knowledge about spam.
Peer-To-Peer (P2P) networks like Gnutella improve some shortcomings of Conventional Search Engines (CSE) such as centralized and outdated indexing by distributing the search engines over the peers, which maintain their updated local contents. But they are designed for sharing and searching the contents in personal computers instead of websites. In this work, we propose a novel web information retrieval method called Site-To-Site (S2S) searching, which uses the P2P framework with CGI as protocol. It helps the site owners to turn their websites into autonomous search engines without extra hardware and software costs. In this paper, we introduce S2S searching with some related work. We also describe the system architecture and communication protocol. Finally, we summarize the experimental results, and show that S2S searching works well in one thousand sites.
The massive distribution of the crawling task can lead to inefficient exploration of the same portion of the Web. We propose a technique to guide crawlers exploration based on the notion of Web communities. The stability properties of the method can be used as an implicit coordination mechanism to increase the efficiency of the crawling task.
Web data integration is an important preprocessing step for web mining. It is highly likely that several records on the web whose textual representations differ may represent the same real world entity. These records are called approximate duplicates. Data integration seeks to identify such approximate duplicates and merge them into integrated records. Many existing data integration algorithms make use of approximate string join, which seeks to (approximately) find all pairs of strings whose distances are less than a certain threshold. In this paper, we propose a new mapping method to detect pairs of strings with similarity above a certain threshold. In our method, each string is first mapped to a point in a high dimensional grid space, then pairs of points whose distances are 1 are identified. We implement it using Oracle SQL and PL/SQL. Finally, we evaluate this method using real data sets. Experimental results suggest that our method is both accurate and efficient.
A problem facing many textbook authors (including one of the authors of this paper) is the inevitable delay between new advances in the subject area and their incorporation in a new (paper) edition of the textbook. This means that some textbooks are quickly considered out of date, particularly in active technological areas such as the Web, even though the ideas presented in the textbook are still valid and important to the community. This paper describes our approach to building a companion website for the textbook Hypermedia and the Web: An Engineering Approach. We use Bloom's taxonomy of educational objectives to critically evaluate a number of authoring and presentation techniques used in existing companion websites, and adapt these techniques to create our own companion website using Semantic Web technologies in order to overcome the identified weaknesses. Finally, we discuss a potential model of future companion websites, in the context of an e-publishing, e-commerce Semantic Web services scenario.
In this paper we analyze a very large junk e-mail corpus which was generated by a hundred thousand volunteer users of the Hotmail e-mail service. We describe how the corpus is being collected, and analyze: the geographic origins of the e-mail who the e-mail is targeting and what the e-mail is selling.
This study examines the effects of different types of site maps on user's performance in an information-searching task for three web sites. Forty-two participants (22 males and 20 females) participated in the study. The results showed significant effects on the types of site maps used. It was found that participants found the correct answers more often, required less time, visited significantly fewer web pages, and required fewer clicks to complete the task when the site map was visible. However, it was found that the participants had a lower success rate in finding the correct answers when the site map had hyperlinks. In addition, the results showed significant performance differences among the three web sites and the effects of a site map were found to be more prominent for a larger web site.
Theoretical analysis of the Web graph is often used to improve the efficiency of search engines. The PageRank algorithm, proposed by Brin and Page, is used by the Google search engine to improve the results of the queries. The purpose of this article is to describe an enhanced version of the PageRank algorithm using a realistic model for the back button. We introduce a limited history stack model (you cannot click more than m times in a row), and show that when m=1, the computation of this Back PageRank can be as fast as that of a standard PageRank.
In this paper, we study the distribution of relevant documents in aggregates, formed by grouping the retrieved documents according to their domain. For each aggregate, we take into account its size, and a measure of the correlation between its incoming and outgoing hyperlinks. We report on a preliminary experiment with two TREC topic distillation tasks, where we find that larger aggregates, or those aggregates with correlated hyperlinks, are more likely to contain relevant documents. This result shows that the distribution of domain-level aggregates is potentially useful for finding relevant documents.
We developed a diagrammatic inference system for the World Wide Web. Our system enables the creation of diagrams such that the information contained in them can be searched and inference can be performed on it. We developed an XMLSchema for bar, line, and pie charts. Based on it, we developed software that transforms a corresponding XML file into an SVG image, which in turn is rendered by the client as an image. Additionally, we developed a search engine which enables a user to find information explicitly contained in the XML file, and as such in the image. Furthermore, we developed an inference engine which enables a user to locate information that is implicitly contained in the image.
This paper presents a search architecture that combines classical search techniques with spread activation techniques applied to a semantic model of a given domain. Given an ontology, weights are assigned to links based on certain properties of the ontology, so that they measure the strength of the relation. Spread activation techniques are used to find related concepts in the ontology given an initial set of concepts and corresponding initial activation values. These initial values are obtained from the results of classical search applied to the data associated with the concepts in the ontology. Two test cases were implemented, with very positive results. It was also observed that the proposed hybrid spread activation, combining the symbolic and the sub-symbolic approaches, achieved better results when compared to each of the approaches alone.
This paper describes the design and implementation of the eduSource Communication Layer (ECL) protocol. ECL is one outcome of a pan-Canadian project called eduSource Canada to build an open network of interoperable digital repositories. The design goal was to achieve a highly flexible, easy-to-use, and platform independent communication layer protocol that allows new and existing repositories to communicate and share resources across a network. ECL conforms to IMS Digital Repository Interoperability (DRI) specifications and supports four main functions: search/expose, submit/store, gather/expose and request/deliver. The ECL protocol builds on the latest standards and is flexible with respect to metadata schemas and repository contents. To support easy adoption of the protocol we provide middleware components for connecting existing systems. The ECL is currently used in the eduSource network, and we have begun work bridging with other interoperable initiatives such as Open Knowledge Initiative (OKI). Based on our experience, ECL is truly flexible and easy to use.
MetaCrystal enables users to visualize and control the degree of overlap between the results returned by different search engines. Several linked overview tools support rapid exploration, facilitate complex filtering operations and guide users toward relevant information. MetaCrystal addresses the problem of the effective fusion of different search results by helping users to visually combine and filter the top results returned by the different engines. Users can apply weights to the search engines to create their own ranking functions. They can control the degree of overlap by modifying the URL directory depth used to match documents or by changing the number of top documents being compared.
Grid computing -- the assemblage of heterogeneous distributed clusters of computers viewed as a single virtual machine -- promises to serve as the next major paradigm in distributed computing. Since Grids are assemblages of (usually) autonomous systems (autonomous clusters, supercomputers, or even single workstations) scheduling can become a complex affair which must take into consideration not just the requirements (and scheduling decisions) made at the point of the job's origin, but also the scheduling requirements (and decisions) made at remote points on the fabric, and in particular scheduling decisions made by a remote autonomous system onto which the local job has been scheduled. The current existing scheduling models range from static, where each of the programs is assigned once to a processor before execution of the program commences, to dynamic, where a program may be reassigned to different processors, or a hybrid approach, which combines characteristics of both techniques [1,4,5]. To address this issue, we have developed a JAVA based discrete event Grid simulator toolkit called HuskySim. The HuskySim toolkit provides core functionalities (e.g., compute objects, network objects, and scheduling objects) that can be used to simulate a distributed computing environment. Furthermore, it can be used to predict the performance of various classes of Grid scheduling algorithms including: Static scheduling algorithms, Dynamic scheduling, Adaptive Scheduling. In our design, we adopted an object-oriented design, which allows an easy mapping and integration of simulation objects into the simulation program. This approach simplifies the simulation of multitasking, and distributed data processing model. Our model of multitasking processing is based on an interrupt driven mechanism. As shown in Figure 1, the simulator works by relaying messages between the core engine and the simulation modules through the message handling sub-system. Once the architecture, the load distribution, and the scheduling algorithms are defined, the object registration subsystem sends a NEW OBJECT REQUEST MESSAGE to the object class libraries and builds a skeleton for the requested simulation experiment. Workloads traces can be generated using probabilistic models. The currently supported distributions are: Uniform, Poisson, Exponential, Normal, Erlang, and Power Tailed. It is also possible to use real world load traces. Moreover, we augmented the Simulator with a statistical module. Using the statistical module provided with the HuskySim, the core simulation engine can send messages to perform various type of analysis on the performance data including: variance reduction, regression, time series analysis, clustering, and data mining. In order to quantify the system performance, the simulator provides various performance metrics including: CPU utilization, disk utilization, application turnaround time, latency, make span, host to host bandwidth, jammed bandwidth, and TCP/IP traffic data. These measurements are handled through the measurement sub-system. Furthermore, the HuskySim can be used to simulate the classes of algorithmic and parametric adaptive Grid schedulers. In which, the scheduling algorithm may not be fixed in advance. Simply, the scheduling algorithm is selected at run time based on the current workload on the Grid fabric in order to operate at near optimal level.
A recently published approach to adaptive page rank, using the solution of quadratic optimization methods with a set of simple constraints, is modified to permit classification of web pages according to their page contents, URLs. This modification allows the approach to be more adapted to the needs of focussed crawlers, or personalized search engines.
In this paper, we present an algorithm for merging results from different data sources in meta-search engine. We further extend one that has developed for ranking players of a round-robin tournament to a more general one when the ranking input is given from multiple sources. The problem in meta-search engine can be represented by a complete directed graph which can be used by the Majority Spanning Tree (MST) algorithm. It is useful especially when the system must integrate and merge the query results that are returned from various search engines in a consistent manner.
We present a Semantic Web application that we call CS AKTive Space. The application exploits a wide range of semantically heterogeneous and distributed content relating to Computer Science research in the UK. This content is gathered on a continuous basis using a variety of methods including harvesting and scraping as well as adopting a range models for content acquisition. The content currently comprises around ten million RDF triples and we have developed storage, retrieval and maintenance methods to support its management. The content is mediated through an ontology constructed for the application domain and incorporates components from other published ontologies. CS AKTive Space supports the exploration of patterns and implications inherent in the content and exploits a variety of visualisations and multi dimensional representations. Knowledge services supported in the application include investigating communities of practice: who is working, researching or publishing with whom. This work illustrates a number of substantial challenges for the Semantic Web. These include problems of referential integrity, tractable inference and interaction support. We review our approaches to these issues and discuss relevant related work.
In this paper, we describe our work in progress on the reasoning module of ec(h)o, an augmented audio-reality interface for museum visitors utilizing spatialized soundscapes and a semantic web approach to information. We used ontologies to describe the semantics of sound objects and represent user model. A rule-based system for selecting sound object uses semantic description of objects, visitor's interaction history and heuristics for continuity of the dialogue between user and the system.
We present a modularized storage and indexing framework that cleanly separates the functional components of a P2P system, enabling us to tailor the P2P infrastructure to the specific needs of various Internet applications eat, without having to devise completely new storage management and index structures for each application.
We propose a new distributed, fault-tolerant Peer-to-Peer index structure for resource discovery applications called the P-tree. P-trees efficiently support range queries in addition to equality queries.
We present an algorithm for updating the PageRank vector [1]. Due to the scale of the web, Google only updates its famous PageRank vector on a monthly basis. However, the Web changes much more frequently. Drastically speeding the PageRank computation can lead to fresher, more accurate rankings of the webpages retrieved by search engines. It can also make the goal of real-time personalized rankings within reach. On two small subsets of the web, our
Recommender systems have emerged in the past several years as an effective way to help people cope with the problem of information overload. One application in which they have become particularly common is in e-commerce, where recommendation of items can often help a customer find what she is interested in and, therefore can help drive sales. Unscrupulous producers in the never-ending quest for market penetration may find it profitable to shill recommender systems by lying to the systems in order to have their products recommended more often than those of their competitors. This paper explores four open questions that may affect the effectiveness of such shilling attacks: which recommender algorithm is being used, whether the application is producing recommendations or predictions, how detectable the attacks are by the operator of the system, and what the properties are of the items being attacked. The questions are explored experimentally on a large data set of movie ratings. Taken together, the results of the paper suggest that new ways must be used to evaluate and detect shilling attacks on recommender systems.
Analysis of web site usage data involves two significant challenges: firstly the volume of data, arising from the growth of the web, and secondly, the structural complexity of web sites. In this paper we apply Data Mining and Information Visualization techniques to the web domain in order to benefit from the power of both human visual perception and computing we term this Visual Web Mining. In response to the two challenges, we propose a generic framework, where we apply Data Mining techniques to large web data sets and use Information Visualization methods on the results. The goal is to correlate the outcomes of mining Web Usage Logs and the extracted Web Structure by visually superimposing the results. We design several new information visualization diagrams.
In ongoing research, a collaborative peer network application is being proposed to address the scalability limitations of centralized search engines. Here we introduce a local adaptive routing algorithm used to dynamically change the topology of the peer network based on a simple learning scheme driven by query response interactions among neighbors. We test the algorithm via simulations with 70 model users based on actual Web crawls. We find that the network topology rapidly converges from a random network to a small world network, with emerging clusters that match the user communities with shared interests.
We propose a method of automatically constructing Web content from video streams with metadata that we call TV2Web. The Web content includes thumbnails of video units and caption data generated from metadata. Users can watch TV ona normal Web browser. They can also manipulate Web content with zooming metaphors to seamlessly alter the level of detail (LOD) of the content being viewed. They can search for favorite scenes faster than with analog video equipment, and experience a new cross-media environment. We also developed a prototype of the TV2Web system and discuss its implementation.
Security remains a major roadblock to universal acceptance of the Web for many kinds of transactions, especially since the recent sharp increase in remotely exploitable vulnerabilities have been attributed to Web application bugs. Many verification tools are discovering previously unknown vulnerabilities in legacy C programs, raising hopes that the same success can be achieved with Web applications. In this paper, we describe a sound and holistic approach to ensuring Web application security. Viewing Web application vulnerabilities as a secure information flow problem, we created a lattice-based static analysis algorithm derived from type systems and typestate, and addressed its soundness. During the analysis, sections of code considered vulnerable are instrumented with runtime guards, thus securing Web applications in the absence of user intervention. With sufficient annotations, runtime overhead can be reduced to zero. We also created a tool named. WebSSARI (Web application Security by Static Analysis and Runtime Inspection) to test our algorithm, and used it to verify 230 open-source Web application projects on SourceForge.net, which were selected to represent projects of different maturity, popularity, and scale. 69 contained vulnerabilities. After notifying the developers, 38 acknowledged our findings and stated their plans to provide patches. Our statistics also show that static analysis reduced potential runtime overhead by 98.4%.
While being quite successful in providing keyword based access to web pages, commercial search portals, such as Google, Yahoo, AltaVista, and AOL, still lack the ability to answer questions expressed in a natural language. In this paper, we present a probabilistic approach to automated question answering on the Web. Our approach is based on pattern matching and answer triangulation. By taking advantage of the redundancy inherent in the Web, each answer found by the system is triangulated (confirmed or disconfirmed) against other possible answers. Our approach is entirely self-learning: it does not involve any linguistic resources, nor it does require any manual tuning. Thus, the propose approach can easily be replicated in other information systems with large redundancy.
RSA is the most popular public-key cryptosystem on the Web today but long-term trends such as the proliferation of smaller, simpler devices and increasing security needs will make continued reliance on RSA more challenging over time. We offer Elliptic Curve Cryptography (ECC) as a suitable alternative and describe our integration of this technology into several key components of the Web's security infrastructure. We also present experimental results quantifying the benefits of using ECC for secure web transactions.
A (directed) network of people connected by ratings or trust scores, and a model for propagating those trust scores, is a fundamental building block in many of today's most successful e-commerce and recommendation systems. We develop a framework of trust propagation schemes, each of which may be appropriate in certain circumstances, and evaluate the schemes on a large trust network consisting of 800K trust scores expressed among 130K people. We show that a small number of expressed trusts/distrust per individual allows us to predict trust between any two people in the system with high accuracy. Our work appears to be the first to incorporate distrust in a computational trust propagation setting.
Mining user access patterns from a continuous stream of Web-clicks presents new challenges over traditional Web usage mining in a large static Web-click database. Modeling user access patterns as maximal forward references, we present a single-pass algorithm StreamPath for online discovering frequent path traversal patterns from an extended prefix tree-based data structure which stores the compressed and essential information about user's moving histories in the stream. Theoretical analysis and performance evaluation show that the space requirement of StreamPath is limited to a logarithmic boundary, and the execution time, compared with previous multiple-pass algorithms [2], is fast.
Without textual descriptions or label information of images, searching semantic concepts in image databases is still a very challenging task. While automatic annotation techniques are yet along way off, we can seek other alternative techniques to solve this difficult issue. In this paper, we propose to learn Web images for searching the semantic concepts in large image databases. To formulate effective algorithms, we suggest to engage the support vector machines for attacking the problem. We evaluate our algorithm in a large image database and demonstrate the preliminary yet promising results.
This paper describes an XPath-based discourse analysis module for Spoken Dialogue Systems that allows the dialogue author to easily manipulate and query both the user input's semantic representation and the dialogue context using a simple and compact formalism. We show that, in managing the human-machine interaction, the discourse context and the dialogue history are effectively represented as Document Object Model (DOM) structures. DOM defines interfaces that dialogue scripts can use to dynamically access and update the content, the structure and the style of the documents. In general, this approach applies also to richer multimedia and multimodal interactions where the interpretation of the user input depends on a combination of input modality.
In this article, we explore a new role for the computer in art as a reflector of popular culture. Moving away from the static audio-visual installations of other artistic endeavors and from the traditional role of the machine as a computational tool, we fuse art and the Internet to expose cultural connections people draw implicitly but rarely consider directly. We describe several art installations that use the World Wide Web as a reflection of cultural reality to highlight and explore the relations between ideas that compose the fabric of our every day lives.
Metadata development can be challenging because the vocabulary should be flexible and extensible, widely applicable, interoperable, and both machine and human readable. We describe how we engaged members of organizations in the field of technical assistance to educators in a process of metadata development, and the challenges we faced. The result was a an ontology for the communities of practice that is interoperable and can evolve it was then used to catalogue resources for dissemination via the Semantic Web.
RDF/XML does not layer RDF on top of XML in a useful way. We use a simple direct representation of the RDF abstract syntax in XML. We add the ability to name graphs, noting that in practice this is already widely used. We use XSLT as a general syntactic extensibility mechanism to provide human friendly macros for our syntax. This provides a simple serialization solving a persistent problem in the Semantic Web.
Current search technologies work in a "one size fits all" fashion. Therefore, the answer to a query is independent of specific user information need. In this paper we describe a novel ranking technique for personalized search services that combines content-based and community-based evidences. The community-based information is used in order to provide context for queries and is influenced by the current interaction of the user with the service. Our algorithm is evaluated using data derived from an actual service available on the Web an online bookstore. We show that the quality of content-based ranking strategies can be improved by the use of community information as another evidential source of relevance. In our experiments the improvements reach up to 48% in terms of average precision.
This paper provides an overview of a technique for extracting information from the Web search interfaces of e-commerce search engines that is useful for supporting automatic search interface integration. In particular, we discuss how to group elements and labels on a search interface into attributes and how to derive certain meta-information for each attribute.
In this paper, we sketch a method for clustering e-commerce search engines by the type of products/services they sell. This method utilizes the special features of interface pages of such search engines. We also provide an analysis of different types of ESE interface pages.
Museum collections contain large amounts of data and semantically rich, mutually interrelated metadata in heterogeneous databases. The publication of museum collections on the web is therefore a very promising application domain for semantic web techniques. We present a semantic web portal called MuseumFinland -- Finnish Museums on the Semantic Web{sup:1}" [3] that contains some 4,000 cultural artifacts from the collections of three museums using three different database schemas and database systems. The system is based on seven RDF(S) ontologies consisting of some 10,000 classes and individuals.
In this paper, we present our development of a document management and retrieval tool, which is named Ontalk. Our system provides a semi-automatic metadata generator and an ontology-based search engine for electronic documents. Ontalk can create or import various ontologies in RDFS or OWL for describing the metadata. Our system that is built upon. NET technology is easily communicated with or flexibly plugged into many different programs.
A number of applications require selecting targets for specific contents on the basis of criteria defined by the contents providers rather than selecting documents in response to user queries, as in ordinary information retrieval. We present a class of retrieval systems, called Best Bets, that generalize Information Filtering and encompass a variety of applications including editorial suggestions, promotional campaigns and targeted advertising, such as Google AdWords. We developed techniques for implementing Best Bets systems addressing performance issues for large scale deployment as efficient query search, incremental updates and dynamic ranking.
This paper presents a transaction-time HTTP server, called TTApache that supports document versioning. A document often consists of a main file formatted in HTML or XML and several included files such as images and stylesheets. A change to any of the files associated with a document creates a new version of that document. To construct a document version history, snapshots of the document's files are obtained over time. Transaction times are associated with each file version to record the version's lifetime. The transaction time is the system time of the edit that created the version. Accounting for transaction time is essential to supporting audit queries that delve into past document versions and differential queries that pinpoint differences between two versions. TTApache performs automatic versioning when a document is read thereby removing the burden of versioning from document authors. Since some versions may be created but never read, TTApache distinguishes between known and assumed versions of a document. TTApache has a simple query language to retrieve desired versions. A browser can request a specific version, or the entire history of a document. Queries can also rewrite links and references to point to current or past versions. Over time, the version history of a document continually grows. To free space, some versions can be vacuumed. Vacuuming a version however changes the semantics of requests for that version. This paper presents several policies for vacuuming versions and strategies for accounting for vacuumed versions in queries.
This paper presents a system for efficient data transformations between XML and relational databases, called XML Data Mediator (XDM). XDM enables the transformation by externalizing the specification of the mapping in a script and using an efficient run-time engine that automates the conversion task. The runtime engine is independent from the mapping script. A parser converts a mapping script into an internal conversion object. For the mapping from relational to XML, we use a tagging tree as a conversion object inside the runtime engine, and use an SQL outer-join scheme to combine multiple SQL queries in order to reduce the number of backend relational database accesses. For the mapping from XML to relational, the conversion object is a shredding tree, and we use an innovative algorithm to process the XML as a stream in order to achieve linear complexity with respect to the size of the XML document.
Applications that run on top of web browsers dominate the Internet today. Given the many similarities among these applications' features, positive transference from one to another is often seen as an important source of ease-of-use for such applications. This paper examines the many differences in the way similar features are implemented in different browser-based applications, analyzing the way these inconsistencies can lead to negative transference (interference) that degrades rather than enhances usability.
Semantic Web technology is intended for the retrieval, collection, and analysis of meaningful data with significant automation afforded by machine understandability of data [1]. As one illustration of semantic web technology in action, we present SEMPL, a semantic web portal for the Large Scale Distributed Information Systems lab (LSDIS) at the University of Georgia. SEMPL, which is powered by a state of the art commercial system, Semagix Freedom [7], uses an ontology-driven approach to provide semantic browsing, linking, and contextual querying of content within the portal. By using the ontology based information integration technique, SEMPL can specify the context of a particular piece of research information, annotate web pages, and provide links to semantically related areas enabling a rich contextual retrieval of information.
In recent years, search engine research has grown rapidly in areas such as algorithms, strategies and architecture, increasing both effectiveness and quality of results. However, a very important aspect that is often neglected is the user interface. In this work we analyzed the interfaces of several popular search tools from the user's point of view, and collected individual feedback in order to determine whether it is possible to improve interface design.
The evaluation of eLearning success is an indispensable business requirement of education programs: the easy registration of 'visits' to eLearning websites is, however, not sufficient in most cases. Additional metrics from authenticated logins and reports of learning activity and success -- as obtained from specific online tests -- are required. The aim is to document the acceptance, progress and return of investment (ROI) of eLearning programs, and set up additional training well tailored to the needs of a specific learning community. An example from a corporate certification program proves the applicability of the proposed processes.
Researchers in Web engineering have regularly noted that existing Web application development environments provide little support for managing the evolution of Web applications. Key limitations of Web development environments include line-oriented change models that inadequately represent Web document semantics and in ability to model changes to link structure or the set of objects making up the Web application. Developers may find it difficult to grasp how the overall structure of the Web application has changed over time and may respond by using ad hoc solutions that lead to problems of maintain ability, quality and reliability. Web applications are software artifacts, and as such, can benefit from advanced version control and software configuration management (SCM)technologies from software engineering. We have modified an integrated development environment to manage the evolution and maintenance of Web applications. The resulting environment is distinguished by its fine-grained version control framework, fine-grained Web content change management, and product versioning configuration management, in which a Web project can be organized at the logical level and its structure and components are versioned in a fine-grained manner as well. This paper describes the motivation for this environment as well as its user interfaces, features, and implementation.
We present in this poster our work on a User Interface Markup Language (UIML) vocabulary for the specification of device- and modality independent user interfaces. The work presented here is part of an application-oriented project. One of the results of the project is a prototype implementation of a generic platform for device independent multimodal mobile applications. The poster presents the requirements for a generic user interface description format and explains our approach on an integrated description of user interfaces for both graphical and voice modality. A basic overview of the vocabulary structure, its language elements and main features is presented.
In this paper, we address the problem of matching I/O descriptions of services to enable their automatic service composition. Specifically, we develop a method of semantic schema matching and apply it to the API schemas constituting the I/O descriptions of services. The algorithm assures an optimal match of corresponding entities by obtaining a maximum matching in a bi-partite graph formed from the attributes.
As web service technology matures there is growing interest in exploiting workflow techniques to coordinate web services. Bioinformaticians are a user community who combine web resources to perform in silico experiments. These users are scientists and not information technology experts they require workflow solutions that have a low cost of entry for service users and providers. Problems satisfying these requirements with current techniques led to the development of the Simple conceptual unified flow language (Scufl). Scufl is supported by the Freefluo enactment engine [1], and the Taverna editing workbench [3]. The extensibility of Scufl, supported by these tools, means that workflows coordinating web services can be matched to how users view their problems. The Taverna workbench exploits the web to keep Scufl simple by retrieving detail from URIs when required, and by scavenging the web for services. Scufl and its tools are not bioinformatics specific. They can be exploited by other communities who require user-driven composition and execution of workflows coordinating web resources.