FOXY
User's Guide
Vienna, February 2006
[]
[]
This ''Users's Guide'' is part (i.e. an extraction of selected chapters) of the Master's
thesis ([1]) written in the course of this project.
This thesis mainly discusses the problem of mobile web access.
Consequently, this guide focuses on transformation tasks for
mobile clients.
Years ago, when the Internet was a network almost only
used by academics or militaries on their Desktop-PCs, its
main purpose was the fast (and hopefully secure) exchange of information.
Nowadays, this media has entered the households
of millions of people and new (wireless)
non-PC devices with web-access have become very popular.
The focus, however, has remained on the exchange of information.
One of the most important remaining challenges is to present
the information in the World Wide Web in a way,
so that anyone can use it anytime, anywhere.
The language of the web, HTML, the HyperText-Markup-Language[2],
has discovered some disadvantages, mostly because it has been
developed from a view- and not content-oriented point of view,
and its ''loose'' standard (which allows malformed markup).
This, and the lack of semantic information (exception: meta-tags),
make HTML-pages not really easy to be further
processed by non-human users.
Additionally, these documents were meant to be interpreted by browsers
running on desktop-PCs, usually equipped with large screens and
''mighty'' input devices such as mice or keyboards.
Therefore, HTML is not a device independent language -
if one wishes to display web pages (written in HTML) properly on a wireless
device, she will either have to enhance the mobile browser (which
is often difficult, due to the possible lack of resources),
invent a completely new language for the Wireless World Wide Web
(such as the Wireless Markup Language, WML[3]) and redefine
the existing pages in this language (so that essential functions may still
be available to a mobile user), or - and this is the core of our work -
convert these documents from their way from the web server to the
client's user-agent (a browser, in the most cases).
Today, the Internet has not only one language anymore - much
efforts have been made to improve HTML, e.g., - mostly to overcome
problems caused by malformed markup - XHTML[4], the
eXtensible HyperText Markup Language,
a redefinition of HTML in XML, has been introduced.
XHTML was a first step, but it was still insufficient.
There was general consensus that only a clear separation of
content and style will meet the needs of all the new
stakeholders and techniques involved - be it automated
extraction of information or delivering this information
to new devices or users.
The World Wide Web Consortium's (W3C) eXtensible Markup Language
(XML[5] - a strict subset of SGML[6],
the Standard Generalized Markup Language) is a
meta-language for defining new markup languages.
In the last years it has proved to be the right language not only for
the definition of standard device languages and
formats (e.g., XHTML, WML...), but also for creating all
kind of ''self-describing'' data.
Furthermore, in combination with XSL ([7]), a language for
transforming (XSL/T) and formatting (XSL:FO) objects, we now
theoretically achieved the desired separation of meaning
and view.
So, with the help of XML and XSL/T it is now possible to create
different views of the same underlying content:
After applying different stylesheets (e.g., one for XHTML-output,
and a WML-version for WAP-browsers) to one and the same
XML-data-file, we obtain several ''viewable'' versions of the (XML-)data-document -
e.g. an XHTML-version for the web and a WML-page for (older) mobile browsers.
This ensures homogeneity, although documents will still be
published under different public (HTTP[8]) addresses (e.g. ''www.mydoc.net''
and ''wap.mydoc.net'').
For many legacy web pages
which have not been created out of an XML-source-file,
however, a redesign from scratch (either in the mentioned way or
in a ''mobile language, such as WML) seemed to be not feasible.
Therefore, the task of converting these documents
''on demand'' became more and more important.
The main idea behind this project was to dynamically transform
HTML-pages to whatever format the requesting client (browser) is
capable to understand.
While we have originally been looking for a new way of delivering web content
to wireless devices, i.e., a transparent, flexible and
extensible way to transform documents written in
HTML to various output-formats (e.g. XHTML or WML), we found a solution
that did not only satisfy us regarding our main aim.
It has also proved to be suitable for (at least some of)
the requirements of many other popular questions in the emerging
field of web engineering.
In the course of our work, we defined (in XML Schema[9])
two new (XML-)languages:
The first language's purpose is the inspection of HTTP-requests (what kind
of device is used, which page is requested..) and the definition
of what (i.e., if or what kind of transformation-rules should be applied
to the corresponding HTTP-response from the web server) should be done if a
web request meets certain criteria.
The other language is used for specifying how to do this transformation.
It does not not only offer possibilities to import or implement XSL- and XQuery-scripts,
also simpler search-replace- and more complex page-splitting-functionalities are
implemented in this language.
For ''common'' transformations (e.g. converting an HTML-form to a WML-form, extracting
links, and many more) we created several predefined stylesheets, so one
may not have to deal with XSL or XQuery by her own.
This layer of abstraction - and the fact that we kept those (configuration-)languages
self-explaining and simple but quite ''mighty'' - not only reduces
development-efforts of (either online- or realized as browser-plug-in)
graphical configuration-user-interfaces, it also provides a ''self-explaining''
way to convert web content on demand.
FOXY - A Prototype Implementation
FOXY (Filtering prOXY) works as an intermediary between (not necessarily)
wireless clients and HTTP servers.
It converts existing web content dynamically on demand
and presents it in the most appropriate form to the (mobile) user.
Aims
This section gives a brief overview of our aims regarding
technique and functionality of the prototype.
One of the main ideas behind this project was to create a
solution that summarizes the most promising approaches to device-independent
web engineering.
We focused on the conversion of text-based content,
although media-data-transformation should also be possible
in future versions.
For the prototype implementation itself we intended to make only use
of Open-Source software solutions.
Many existing web-transcoders use ''intelligent''
algorithms to decide, how a requested web page should
be transformed i.e. how it should be displayed on a
small-screened device.
Nevertheless, we wished to leave this decision at the
side of the user or the transformation-language of her
choice, respectively.
So one may create a special XSL-stylesheet for the transformation
of a particular web-page (this would be an knowledge-rich
approach), but can also implement an (heuristic?)
HTML-to-WML transformation algorithm in an XQuery-stylesheet
(which may then be applied to e.g. all pages or a group of websites).
These stylesheets can then be imported/included in the server's
configuration.
In our prototype, all transformation-actions
are referred to as (transformation-)rules.
A rule, in FOXY, may consist of simple text search-and-replace actions, XSL- or
XQuery-transformations and more complex page and form splitting
actions and the modification of HTTP-request- and -response-header fields, respectively.
Assuming the user wishes to follow the knowledge-rich approach
(i.e., she does not want the system to decide which pages should
be converted), we had to find an efficient way
to provide her the possibility to ''tell'' the server which pages should be
transformed, i.e., what criteria an incoming HTTP-request has
to meet, so that the corresponding HTTP-response data (or also the original request itself)
will be modified by one ore more transformation-rules.
Therefore, we compare incoming HTTP-requests with so called
(HTTP-request-)patterns that are defined by the user (in the patterns configuartion file).
When an HTTP-request matches a pattern (e.g. host=''www.xyz.com'', path equals ''/''),
a set of transformations - or rules - are then be applied to the
corresponding response of the web server (or even before - to the client's request-header).
For this reason, we compare (also involving regular expressions)
characteristical parts the HTTP-request, such as server, port, path, parameters and header fields to
corresponding values of stored request-patterns.
This makes it for example possible, to
apply particular transformation-rules (or a group of them) for
all requests with header field ''User-Agent'' containing the word ''WML''
or for all pages on server ''www.example.org'' in URL-path ''/news''
or for a certain page with URL-parameter ''user'' set to ''john'',
to name just a few.
Some approaches require special tags embedded
in the original (X)HTML-page to fulfill some transformation-tasks
(e.g. the splitting of a large web page into smaller parts, which may
then be displayed separately - naturally with a simple
navigation-menu - on the mobile device's screen).
This works fine, but usually only if
the user has created the page by herself - for web pages
that have not been ''equipped'' with those special-tags
this static approach is not sufficient.
That is why we decided to implement a dynamic solution for
this issue, although we also offer the possibility to
support more complex transformations with the help of embedded tags.
Most of the discussed approaches
use their own proprietary transformation-language
for the defining the transformation-process itself.
As our aim is to convert one Markup-Language into
another, we decided to make use of existing standards
(such as XSL and XQuery) for converting documents defined in
Markup-Languages.
Implementation
In this section, we take a look at FOXY's design concepts
and discuss the realization of essential parts of the server.
General Design Concepts
One of our aims regarding the prototype's implementation was to
keep it as flexible as possible.
Flexibility, in this case, does not only refer to the usage of
the system - we also wanted the implementation following a modular approach,
so that essential functional parts may be changed (either by the user himself or by one
of the developers) or replaced (if e.g. a newer or better solution for a certain
task exists) over time, without the need of adapting other parts of the system to these
modifications.
So, our first task was to split our problem (mobile web access) into pieces and
to design an architecture that combines and interact between these pieces.
Our prototype implementation basically consists of three main parts:
- A pattern-matching-system that examines incoming HTTP-requests (e.g. host, path, port,
URL-parameters, header fields) and returns the appropriate actions (called ''rules'')
needed to transform the requested content.
- A database of (transformation-)rules (and rule-groups),
i.e. a repository for the definitions of transformation-actions.
As mentioned before, a rule can consist of a simple text-search-replace-action, an
XSL-script, an XQuery-script or it may perform more complex
tasks (e.g., page- or form-splitting).
- The proxy server itself, which listens for incoming requests, forwards
them to the web server and - with the help of the the pattern-matcher and the
set of transformation-rules - returns the transformed HTTP-response from the server to the
(mobile) client.
To ensure a high degree of usage-flexibility, request-patterns and transformation-rules
are stored in external files (and can be modified at server's runtime).
The essential functionality of the server's ''filter''-(or transformation-, respectively)
process is realized in these configuration-files.
They are loaded at startup and can be changed and reloaded (separately) at runtime.
Our first candidate regarding the file-format was naturally XML -
it has shown to be the right language-standard for describing data and there are many
free parsers and tools available that make the handling of XML-based data-structures
quite convenient.
For the implementation of the server, we decided to use the Java ([10])
programming language - in the last years it has clearly proven to be a stable
solution for creating object-oriented (and often distributed) network- and
Internet-applications.
Additionally - if we keep the format of our configuration-files in mind -,
there are a large variety of HTML- and XML-parsers, XSL- or XQuery-engines, HTML-to-XHTML-converters
and data-binding-utilities 2.1 available in Java.
In the next section, we describe some other tools and technologies we made use of
to develop a modular and extendable prototype.
Used Technologies and Tools
The (XML-)languages of the configuration files have been
designed with the help of W3C's XML Schema([9]).
XML Schema does not only allow the definition of own data types and -subtypes
(including inheritance), it also eases the automated generation of
Java-classes corresponding to these types (and/or elements, respectively).
As our server is implemented in Java and our configuration-files are
stored in XML-format, we needed an efficient way to im- and export
the Markup-data into Java.
The automated creation of representations of XML-data types (and elements, respectively)
in other (programming) languages is referred to as databinding).
In our case, we use Sun's Java Architecture for XML Binding ([11]), JAXB, to
generate Java-classes out of an underlying schema.
JAXB also ensures that XML-files are compliant to a given schema, hence, there is
no need for an explicit (often time- and performance-consuming) verification of
the configuration-files' format.
The usage of these JAXB-created classes is further explained in the
description of the configuration-package (see section 2.3.2).
For the implementation of the prototype, we used following
Open-Source ([12]) Markup-processors
- The Xalan XSLT-engine for the processing of XSL-stylesheets.
Xalan internally uses the Xerces ([13]) XML-parser.
- The Saxon ([14]) XQuery-processor and XSL-parser for the appliance of XQuery-stylesheets.
- The conversion from HTML to well-formed X(HT)ML (a process often referred to as ''tidying'')
is done by the JTidy ([15]) HTML-checker and -pretty-printer.
- For the parsing of HTML-data (used in the page-splitting process), we made use of an Open Source
HTMLParser ([16]) (with the fancy name ''HTMLParser'').
Most of them are implemented (as singletons) using the popular factory-pattern
and can be set via special parameters given to Java's Virtual Machine
at runtime (see section 2.3.2 and 2.3.2).
When writing a server in Java, in most cases it is not appropriate to implement
logging-facilities by oneself.
There are many free packages available that are intended to support developers
in this task.
Although newer versions of Java (since 1.4.1) include built-in logging-features, we decided
- mostly due to performance reasons - to use an external solution, called log4j.
Log4j, a project maintained by the Apache Software foundation
, is designed
for highly customizable runtime logging without incurring a heavy performance loss.
We choose this logging-package, because it has been used in many popular software-projects over the last years and proved to be very stable and efficient.
This section demonstrates the server's functionality on the basis of a common
connection scenario.
We give a short description of some functional important classes and
how they act together during the connection- and filtering-process.
More detailed information about the mentioned Java-classes and packages can be found in section
2.3.2.
Figure 2.1 illustrates the stakeholders involved and the
actions that have to be done, whenever a client user-agent requests a web page through
the proxy server.
Figure 2.1:
Overview of FOXY
|
In the first step (1), the client sends an HTTP-request to the proxy server.
An HTTP-request generally consists of one (text-)line containing the request's
method (e.g. GET, HEAD, POST) and its URI, as well as an HTTP-header, which
contains one or more key-value pairs, called HTTP-header fields (see figure 2.2).
Figure 2.2:
HTTP-request message format
|
A simple HTTP/1.1-request (to ''http://www.xyz.com/'')
is shown in figure 2.3 (Note: Header field ''Host'' is mandatory in HTTP/1.1):
Figure 2.3:
Common HTTP (1.1) request
|
When a client connects through a proxy server, however, it
sends the full URI to the proxy (because it needs the server's
address), which is then responsible for
a correct forwarding of a ''proper'' HTTP-request (see figure 2.4)
Figure 2.4:
HTTP (1.1) request through a proxy server
|
In the next step (2), the server looks for a pattern
that may match the client's request.
A pattern is unique for every host (and port) and
can include several ''sub-patterns'' for different paths
on this host.
To enlarge granularity and flexibility, pattern-definitions may also contain
simple conditional expressions (see section 2.4.2).
When the incoming request has matched a particular pattern, the
PatternMatcher-component delivers back a set of
transformation-rules (3).
Additionally, the pattern may also include instructions
for modifying the HTTP-request forwarded to the original server
This is mostly needed if the client asks for content, the web server
cannot deliver. In this case, we have to change the ''Content-Type''-header field
to an appropriate value before delegating the request to the web server (4).
All other transformations (except the redirection of requests, naturally)
are done after the HTTP-server has sent back the requested content to
the proxy server (5).
This task involves the appliance of one or more XSL- or XQuery-stylesheets,
simple search-and-replace operations or the splitting (or re-layouting)
of pages. These transformation-rules may also be summarized in so-called
groups.
In the last step (6), the proxy server returns the transformed response-data -
which may include modified header fields - back to the client,
The different types of transformations (or -rules, respectively) (I-IV)
are explained in detail in the configuration-section
(see section 2.4.2 and 2.4.3.
This section focuses on the software-architecture of the FOXY-system.
Although an exhaustive JavaDoc-documentation is included in the
(source-package of the) prototype implementation,
we shortly describe some important packages
and classes in the following.
The source-package of FOXY, including a detailed
JavaDoc-documentation can be downloaded under
following link:
http://infosys.tuwien.ac.at/foxy
This site provides various distributions (binary, source)
as well as a short description of the project.
Important Packages
In this section, we take a further look at the Java-packages
of the implementation.
After the explanation of meaning and usage of each
package follows a short description of its important classes.
foxy.config
The classes in the foxy.config package have all been automatically
created by Sun's Java Architecture for XML Binding (JAXB).
JAXB creates Java-source-files out of an XML Schema description.
The generated Java classes represent the structure of the Schema definition.
This not only eases the reading and writing of XML-data, it also
ensures that the data is compliant to a certain Schema.
The foxy.config package consists of three sub-packages:
rules, patterns and server - each one responsible for the configuration-data
of the same name.
The classes in the root of these sub-packages are more or less a
''Java-representation'' of the elements described in the corresponding Schema-file.
From outside the package, we only have deal with these objects (named after
their underlying XML-elements) and do not need to parse the configuration-files
by our own.
Figure 2.5 shows an example for reading in the server's (HTTP-request-)patterns configuration file:
Figure 2.5:
Reading an XML-file compliant to the W3C XML Schema of the patterns-configuration file-format
|
After ''importing'' the patterns-file into Java (a process called unmarshalling),
all data from the XML-file is accessible through the Patterns-structure returned
by the Unmarshaller.
The Pattern-object has only one public accessible function, getPattern(),
which returns an instance of the java.util.List-class that contains a list
of Pattern-objects.
This procedure is similar to the synchronization of the rules-configuration data.
Although we could use these objects directly, we decided to wrap the data into
more appropriate data-structures to facilitate the proxy server's runtime-performance.
In figure 2.6 we create a Hashtable of PatternMatcher-objects (see section 2.3.2) at runtime after the unmarshalling of the Patterns-element.
Figure 2.6:
Example: Wrapping JAXB-created Pattern-objects into FOXY's PatternMatcher class
|
A more detailed description of structure and usage of the configuration-files can
be found in section 2.4.
foxy.proxy
This is the core package of the application.
It contains the server classes
as well as sub-packages for special functionalities (filter, transform, connection)
which are further explained in the following sections.
ProxyServer
This class represents the server-daemon.
It listens at the server's TCP-port and starts
a request-handling thread (called ProxyHandler)
for any incoming connection.
The server is usually started via the foxy.Shell-class
(the only class in the root foxy package), which
provides a text-based interface to view or change server-options
as well as a clean way to shutdown the server.
ProxyHandler
After a connection has been delegated to the ProxyHandler,
this thread takes responsibility for the incoming request.
It first tries to find a pattern-definition according
to the web server's host-name defined in the request.
If a pattern-definition has been found, the PatternMatcher (see below) compares the
request-data to it and - if a match occurred - delivers back a set of transformation-rules.
The handler-thread then contacts the web server (note: the forwarded request to this server can also
be modified by the proxy) and reads in the HTTP-response.
Afterwards, the transformation-process itself is done (mostly) with the help of the transform- and
filter- (sub-)packages.
At the end, the ProxyHandler delivers the ''filtered'' (i.e. modified) response-data back
to the client and stops.
Figure 2.8 shows a simplified diagram of this process.
Simplified, because it does - due to readability-reasons - neither contain
representations of the filter- and transformer-classes involved in the
transformation-process nor the abstractions of the HTTP-connections (interface ProxyConnection)
mentioned in the foxy.proxy.connection-package (see section 2.3.2).
PatternMatcher
This class is an enhanced representation of the pattern-element in
the patterns-configuration-file.
A hash table of PatternMatcher-objects (unique for every host)
is stored in the server's environment (class ProxyEnvironment, see section 2.3.2)
to ensure a quick and efficient way to match incoming requests against stored patterns.
ProxyEnvironment
This class holds important information about the running server.
This involves the definitions of rules and rule-groups (stored in hash tables),
a map of PatternMatcher-objects (see @@@), information about the
server's status (port..) and logger-functionality.
foxy.proxy.connection
This package is responsible for the abstraction of all network-connections of
the server.
It provides a public interface called ProxyConnection and two
implementations of this interface, one dealing with client-side connections (class
ProxyRequestConnection) and one for handling the connection
to the web server (ProxyResponseConnection.
Figure 2.7 shows the relation between these classes/interfaces.
Once the request-/response-data is wrapped in the ProxyConnection-interface
(or its implementing classes, respectively), the content is available as
array of bytes and the header fields are accessible either as hash table
or as pure text.
A HeaderField, in FOXY, is not just a name-value-pair.
To overcome problems caused by name-inconsistencies (e.g. ''User-agent'' or ''User-Agent''),
we additionally store a lowercase version of the header field's name, which acts as
a unique key for the field.
Figure 2.9 shows the usage of the connection-classes during a common
HTTP-connection through FOXY.
Figure 2.7:
Main classes/interfaces of package foxy.proxy.connection
|
Figure 2.8:
Establishing a connection through FOXY (simplified)
|
Figure 2.9:
Usage of the connection-classes
|
foxy.proxy.transform
Although its name may presume that this package handles
the appliance of transformation-rules, this process
is done by FOXY's filter-package (see section 2.3.2.
The transformation-package is only responsible for converting
HTML- to well-formed XML-data.
For the implementation of the transformer, we used
the well-known factory design-pattern.
The user may implement and use its own ProxyTransformerFactory
(and its own transformer-implementation, respectively) by
delivering the name of the personalized TransformerFactory-implementation to
Java's virtual machine during startup.
This is done by setting the parameter
foxy.proxy.transform.ProxyTransformerFactory
to the name of its own implementation (default value:
foxy.proxy.transform.DefaultTransformerFactory).
The factory class is implemented as singleton which ensures that there
can exist only one instance of it in the system.
The factory-class contains one public function - newTransformer() - for
the creation of a new ProxyTransformer.
The ProxyTransformer-interface contains only one public method, called html2Xml(..),
which is responsible for converting HTML-data to well-formed XML (or XHTML, respectively).
Its default implementation - DefaultProxyTransformer - uses the JTidy-pretty-printer
to achieve this aim (see section 2.2.2).
Figure 2.10 shows the relation between the
classes and interfaces of this package.
Figure 2.10:
Important classes/interfaces of package foxy.proxy.transform
|
foxy.proxy.filter
This package represents the implementation of the several
transformation-(or filter-, respectively)actions of
the system.
The meaning and usage of the factory-class defined in this package
(ProxyFilterFactory) is similar to the one in the transformer-package
(see section 2.3.2).
It is also implemented as singleton-class.
The virtual machine parameter, which is responsible for the definition
of the class's implementation is named foxy.proxy.filter.ProxyFilterFactory
and its default value is by default set to the name of the default filter-factory-class
(foxy.proxy.filter.DefaultFilterFactory).
The ProxyFilter-interface (and its default implementation - DefaultProxyFilter) provides several
functions, such as filterXSL(..), filterXQuery(..) or filterSearchAndReplace(..)
for converting data, which return a byte-array of the transformed content.
By overwriting the default factory- and filter-class (called DefaultProxyFilter),
one can implement these functionalities by her own.
The structure of these classes is presented in figure 2.11.
Figure 2.11:
Important classes/interfaces of package foxy.proxy.filter
|
In addition to the common filter-classes, the package contains one class that
supports the completion of more complex transformations, called HTMLNodeVisitor:
HTMLNodeVisitor
This class performs page-splitting and process-partitioning
tasks, respectively. It is inherited from the HTMLParser's-package (see section 2.2.2)
NodeVisitor-class, which can be loosely compared to the functionality
of a common SAX XML-parser's ContentHandler.
When parsing an HTML-page, the parser decides (by given parameters and
special group- and subgroup-tags embedded in the HTML-code) which
parts of a page are left out.
The remaining HTML-data is then delivered back to the client. When splitting
large documents into smaller pieces, it is also possible to add different headers and
footers to these pieces.
The transformation-rule responsible for this process is called layoutPage,
and is described further in the configuration-section (see refConfiguration)
Configuration
In this section, we describe how the server's transformation and pattern-matching behavior
can be customized.
FOXY is configured by three XML-files.
In the following, we describe these configuration-files and their usage in detail.
The Server Configuration File
The server configuration file stores general server parameters,
such as listening port or the naming-conventions and the path of the log-files.
Additionally, it contains information about the location
of the two remaining configuration files.
Figure 2.12 shows the general structure of
the server configuration file.
The whole W3C XML Schema definition is shown in figure 2.13.
Figure 2.12:
Structure of the server configuration file
|
Figure 2.13:
W3C XML Schema of the server configuration file
|
Figure 2.14:
Example for a common server configuration file
|
The server configuration file consists of following elements:
- server: This element is the root element of the server configuration file.
All other elements are children of the server-element.
It provides one attribute (port) describing the server's listening port.
- logFiles: This element stores the location of the log-files-directory,
(in our example the log-files are stored in a directory named ''log'',
which path is relative to the server's home-path).
The attributes prepend and append define text that is
ap- or prepended to the name of the log file.
This file is automatically created by the server at startup (its initial name is a time
stamp in the following format: yyyy-MM-dd-HH-mm-ss-SSS).
- patterns and rules: These elements contain information about
the location of the rules- and patterns configuration files explained in the
following sections (attribute fileName).
An example server configuration file is shown in figure 2.14.
The language of the patterns configuration file has been designed
for specifying so-called HTTP-request-patterns, against which
incoming requests are matched.
This involves the examinitation and comparison of characteristical
parts of the request. The patterns-language also allows simple selection constructs (if-then-else).
Figure 2.16 shows a structural representation of this
file's format.
A complete definition of its underlying W3C XML Schema can
be found in the appendix.
The main elements allowed by the W3C Schema definition of the patterns-language are defined as follows:
- patterns: This element is the root element of the patterns configuration file.
It has no attribute and contains a sequence of pattern-elements
(see figure 2.15).
Figure 2.15:
W3C XML Schema of the patterns-element
|
- pattern: The attributes of this element contain information
about the host name and the port of the web server, which are compared to the
corresponding values of the client's HTTP-request in the pattern-matching process.
Wildcards for these attributes that match every host name and every port are ''*'' and ''-1'',
respectively.
The pattern-element consists of a sequence of path-elements,
which provide further examination (and comparison) of the HTTP-request (see
2.17).
Figure 2.16:
Structure of the patterns configuration file
|
Figure 2.17:
W3C XML Schema of the pattern-element
|
- path: The meaning and usage of the attributes of the path-element are
similar to the ones in the pattern-element, except that this time -
instead of the host name
- the path of the requested URI is compared to the match-attribute of the path-element.
Additionally, it provides a second (optional) attribute, named compare, which defines the type of string-comparison that should be used when trying to match the
path of the requested URI to the element's match-attribute.
The path-element has either one child - the action-element (see
2.19), or it may contain a sequence
of if-elements followed by a closing else (see figure 2.21).
- action: This element (of type applySequence, see figure 2.19)
contains a sequence of transformation-rules and -rule-groups (or
their names, respectively) that should be applied to an HTTP-response after
host name, port and URI-path have successfully been matched against the
corresponding values of the pattern. This sequence is implemented as a set of text-elements,
named applyGroup and applyRuleGroup, respectively (see figure
2.19).
Possible modifications of the header fields (of the forwarded HTTP-request)
may be specified before (child-element setRequestHeaderField) the definition of
this so-called applySsequence .
Alternatively - instead of a sequence of rule-references - the
action-element may only consist of a redirect-element (see
2.20).
This is used for delegating an HTTP-request to a different location (Note: This new location
is then also compared to the available request-patterns, i.e. it will be handled in the
same way as a common client-request.
The maximum count of consecutive redirections is limited by three.).
Figure 2.18:
W3C XML Schema of the path-element
|
Figure 2.19:
W3C XML Schema of the action-(type applySequence)
|
Figure 2.20:
W3C XML Schema of the redirect-element
|
- if, then and else: The if-element,
allows it to specify particular actions for request-URIs with equal host- and pathnames, depending
on the values of certain header fields and/or URI-parameters.
These fields are compared with the help of the elements headerField and urlParameter,
that consist of a name- and a value-attribute and an optional one (compare),
responsible for the type of string-comparison (allowed types of comparison are:
equals, startsWith, endsWith, contains, matches and
containsRegExp) (see figure 2.21).
The then-element comes into action, when the values in the if-clause (i.e. particular
header fields or parameters of the request-URI) have successfully matched.
This element - as well as the else-element and the body of the
if-element - is of the same type as the action-element
(applySequence, see figure 2.19).
2.19
Figure 2.21:
W3C XML Schema of the if-element
|
The rules-Language
Since the rules configuration file is responsible for the definition
of transformation-rules, this language offers possibilities to
import or define stylesheets implemented in transformation-languages
such as XSL or XQuery.
In addition, a rule can contain instructions for modifying the
HTTP-header of the web server's response before delivering it to
the client.
Figure 2.22 shows the general structure of the
rules-language.
Figure 2.22:
Structure of the rules configuration file
|
In the following, we outline the general structure of the rules configuration file.
Then follows a detailed description of important language-elements.
- rules: This element is the root element of the rules configuration file.
It has no attribute and contains a sequence of rule-elements and
ruleGroup-elements (see figure 2.23).
Figure 2.23:
W3C XML Schema of the rules-element
|
- rule:
The rule-element is responsible for defining transformation-actions.
A transformation may be a simple text search-and-replace action, the appliance
of XSL- or XQuery-stylesheets or the splitting of a large web page or -form.
Therefore it can consist of a searchFor-replaceWith-element,
elements for defining or importing stylesheet-data (named xsl and xquery),
or an element that deals with the splitting and re-layouting of web content, layoutPage.
It may also contain information regarding the response header, i.e. if or how the
header of the HTTP-response should be modified. This is done with the help
of one or more setResponseHeaderField-elements at the beginning of the rule-definition
(see figure 2.24).
Figure 2.24:
W3C XML Schema of the rule-element
|
- ruleGroup:
A ruleGroup is a sequence of references (via the rule's id-attribute)
to (previously defined) rule-elements (see figure 2.25).
With the help of this element, several common rules can be grouped and can in the following be
identified and accessed by one unique id.
Figure 2.25:
W3C XML Schema of the ruleGroup-element
|
- searchFor and replaceWith:
This rule-type deals with simple text search-and-replace transformations.
It consists of two text-elements, called searchFor and replaceWith.
The first one defines text to be searched for, while the latter holds the text-data, the search-string
is replaced with. Common regular expressions (Java-style, similar to Perl-regular-expressions)
are accepted.
- xquery and xsl:
These elements are responsible for the definition and the import of transformation-stylesheets.
Both are of type xscript (see figure 2.26).
Because HTML-content has to be converted in well-formed XML before any stylesheet can applied,
the xscript-data type provides one attribute (preTransform) to tell the system, if
the underlying data has to be ''pretty-printed'' (or ''tidied'') before the transformation process.
The body of the xsl- and xquery-elements consists of one text-element - either import, which holds the name of an external-script-file, or script, which allows
a direct in-place definition of the stylesheet in the text-content of this element (as CDATA).
Figure 2.26 shows the W3C XML Schema of the xscript-element.
Figure 2.26:
W3C XML Schema of the xscript-data type (used by elements xsl, xquery and layoutPage
|
- layoutPage:
This element is responsible for splitting web pages and -forms and/or adding header- or footer-data to
particular pages.
When splitting a large page into smaller pieces, we achieve this by summarizing the ''interesting'' parts in groups and subgroups.
This grouping is done by adding special group- and subgroup-tags (named foxy:group and
foxy:subgroup) into the (existing) web page. This may be done dynamically (by adding the tags via a foregoing
xsl- or xquery-rule) or statically (e.g., when you are the owner of the page). A unique ID is assigned
to each group in order of appearance in the original (HTML-)content. Each subgroup has a unique ID for its group
(i.e., each first subgroup in a group has ID=1) All documents transformed via layoutPage do accept following
URI/URL-parameters:
- ui - allowed values: -1(all), 0(none), 1, 2..
- sg - allowed values: same as ui
- reset - any value allowed, same as ui=-1 & sg=-1
The layoutPage-element has five (integer-)attributes, holding the values for
the first and last group-IDs and subgroup-IDs (default for all: -1) and one
for defining the default group that should be displayed, when no group-selecting parameter
has been provided:
- firstGroup
- lastGroup
- firstSubgroup
- lastSubgroup
- defaultGroup
Possible (optional) child-nodes of this element are defineTag (multiple times), header and
footer.
The header- and footer-elements are ap-/prepended to every displayed page-fragment (either a
group or a subgroup) and may help you with adding dynamic context to our transformation.
The header- and footer-scripts are either defined directly (via a script-element) or imported
from a file (similar to the xsl- and xquery-rules, see figure 2.26).
Furthermore, the user can define her own tags, e.g. to give headers of different groups a different look.
This is done by providing special-tags, which are replaced at runtime with their appropriate value.
These special-tags are:
- @currentPage
- @currentGroup
- @currentSubgroup
- @previousGroup
- @nextGroup
- @previousSubgroup
- @nextSubgroup
- @firstGroup
- @lastGroup
- @firstSubgroup
- @lastSubgroup
Figure 2.27:
W3C XML Schema of the layoutPage-element
|
The defineTag-element, which purpose is the definition of special tags,
has following attributes:
- name - The name of the tag
- value - The value of the tag
- source - The group, in which the original tag-definition (in the
HTML code of the header- or footer-element)
should be replaced with the specified value (of the value-attribute).
A self-defined tag has following structure:
We simply describe the usage and meaning of this tag with the help of a short example:
Assuming, one has splitted a web page into groups, it may be convenient to have
a different header on the first page of the group (e.g. a title- or welcome-text).
We define our ''dynamic'' tag as follows:
The above definition tells the system to replace every @welcome-tag found in the
HTML-content with the text ''Welcome to the first page'', - but only if the displayed
source-page/group is the first one in the sequence of groups (attribute source=''1'').
On any other pages, the default text is used. This text is defined directly in the HTML-code.
The @welcome-tag may be implemented (e.g., in the HTML-code of the
header- or footer-element):
As a result, whenever one views the first page/group, the customized
welcome-text (i.e., ''Welcome to the first page'') is ahown.
In all other cases the default-text (e.g., ''Group 4 of 8'') will be displayed.
Figure 2.27 shows the W3C XML Schema of the layoutPage-element.
Evaluation
This chapter analyzes the prototype implementation of the
FOXY system and the concepts of several transformation-techniques
in practice.
Case Study
In this section, we describe how FOXY can be used to create a simple mobile version
of the popular GMX free e-mail service web site (i.e., http://www.gmx.com). The
aim was to extract some content from the site that was of interest for mobile
access.
Figure 3.1 shows the original web page.
In this figure, the parts that are interesting are marked with enclosing rectangles.
These include three menus (products, themes and shopping - on the left side of the page)
and the login form of the mail service.
Figure 3.1:
The original version of the GMX web site (http://www.gmx.com)
|
When requesting this page with a common desktop-browser, only the chosen menus and
the login-form should be displayed to the mobile user (presented in an HTML-table),
so one does not have to deal with all the news- and advertising-content of
the original page.
When accessed by a WAP-enabled mobile phone, however, only the site's login form should be delivered
to the mobile user.
Prerequisites
The transformation process starts by first manually inspecting the
HTML code of the desired web page.
Figure 3.2 depicts part of the HTML source
code of the GMX web page that is of interest. Note that the menu content (i.e.,
''Produkte'' - the german translation of ''product'') is enclosed by HTML list elements (i.e., <li>), each with a unique class
attribute that specifies the style of the menu. The login form is implemented
using HTML form elements (i.e., <form>).
After the elements and patterns in the HTML source code have been identified,
HTTP-request patterns can be created and fed into FOXY to initiate the
transformation and adaptation action whenever the web page is requested by a
specific type of user-agent.
Figure 3.2:
The HTML source code of the GMX page to be adapted
|
Figure 3.3 shows the HTTP-request pattern specification for the main GMX
web page. Note that the web server and the port of interest are specified on line
2 and that the URL is specified on line 3 (in our example, the pattern is
only valid for the start-page of the web site). If desired, wildcards can be used for the
specification of URL-patterns.
The pattern specification in Figure 3.3 tells FOXY to apply the rule group
(i.e., a collection of transformation rules) gmx.mobile whenever a WAP-enabled
client accesses the main GMX page (see lines 4-11). Then, the User-Agent HTTP-header
field is checked (regular expression) to see if the request contains either wap, WAP, CLDC, MIDP
or MMP. If, however, the User-Agent field does not match, FOXY will
apply the transformation rule group named gmx.browser (line 13).
The header field's name is checked in non-case-sensitive manner (i.e., ''user-agent'' and ''User-Agent''
are handled equally).
Figure 3.3:
The HTTP-request pattern for the URL http://www.gmx.com/
|
Figure 3.6 shows the transformation and adaptation rules that
are applied to the GMX web page whenever a WAP-enabled client accesses it.
A rule group called gmx.mobile has been defined (see line 32) and it
specifies two rules: removeEntities and extractLoginFormAndMenu.
Note that the removeEntities transformation rule (lines 2-4) contains a
searchFor and a replaceWith element. These elements provide text-based
search-and-replace functionality that allows the usage of regular expressions.
In our example, we need it to cut out XML-entities (&..;) that cannot
be processed by the Saxon XQuery processor.
The transformation rule called extractLoginFormAndMenus (line 5) is
defined by an xquery element. This element has one optional attribute named
preTransform (line 6). By setting it to true (which is the default value), the input
data is then considered to be not well-formed XML (i.e., HTML) and will be converted to (well-formed) XHTML content (i.e., a process often called tidying) before the XQuery stylesheet is applied.
As we did not want to implement the XQuery script directly in the rule-body, we use the
import element (i.e., import) which allows the specification of an external
XQuery stylesheet file - i.e., gmx2html-table.xql, line 7) to be imported.
The complete listing of this script can be found in the appendix (see figure A.3).
It is similar to the script used for content-adaption for mobile clients,
but it does not only deliver the login form.
In addition, it shows some ''important'' menus and produces HTML-code instead of WML-output.
The extractLoginForm-transformation-rule is also implemented as XQuery-script.
This time, the xquery element has one child named script
(e.g., lines 13-28). This element indicates that the XQuery script
is implemented directly in the rule database
The XQuery-code (lines 15-26) is implemented quite straightforward: First an HTML-form with the name ''login''
is extracted. Then it is wrapped into a WML card element and finally presented
as WML-document (wml!DOCTYPE wml...).
Because WAP browsers do check the Content-Type header field and will produce
an error message whenever HTML-content is detected, it is required to change the value
of this field to indicate that WML-content is delivered (i.e., text/vnd.wap.wml, line 11).
Figures 3.5 and 3.4 show screenshots of the transformations
as seen on a traditional browser and a WAP phone.
Suppose that the information being extracted from the web page is large
and needs to be split over a number of smaller pages. In this case, the splitting
elements foxy:group and foxy:subgroup are ''inserted'' into the extracted
content by means of XQuery or XSL instructions. The layoutPage-element can then be
used within the rule implementations to browse between the resulting page splits.
Note that in web sites which use a common layout (i.e., corporate identity),
FOXY is especially effective because the same HTTP-request pattern and transformation
rules can be applied to a large number of web pages.
Figure 3.4:
The result of the transformation as seen on a traditional PC browser (i.e.,
transformation gmx.browser)
|
We demonstrated, how FOXY can be used to create different
versions of the same web page, depending on the connecting user-agent.
For the extraction of predefined pieces of the page and the creation
of different output-formats, we used the XQuery transformation language.
Figure 3.5:
The result of the transformation as seen on a WAP phone (i.e., transformation
gmx.mobile)
|
Figure 3.6:
Snippet of the transformation and adaptation rules to apply to the GMX web page
|
Conclusion and Future Work
In the example presented in the previous section,
we demonstrated, how FOXY can be used to dynamically create different
versions of the same web page, depending on - for example -
the connecting client user-agent.
In this chapter, we discuss the assets and drawbacks of our system,
focused on flexibility, extensibility and maintenance costs
(including learning efforts).
Flexibility
One of the main aims of our prototype implementation was to
summarize the advantages of generally accepted approaches.
Some of the solutions we investigated offered possibilities to
create static profiles for each device and content, some were
only concentrated on one particular (mobile) output-format (e.g., CHTML or WML)
while others (the earlier ones) did only provide HTML-to-HTML conversion.
We decided to create our system ''content-type independent'', i.e., we
handle all text data in the same way and do only differentiate between
well-formed XML (e.g., XHTML) and non-XML-conform content (HTML, in most cases).
Conversions from one content-type to another are done by stylesheets defined
in standard transformation-languages, notably XSL and XQuery.
To avoid inconsistencies between the content-type and the corresponding
field of the HTTP-header, we provide facilities to change
the header fields of the client's request and the server's response,
respectively.
This ensures a high degree of flexibility, although the efforts
of maintaining the system include the knowledge of at least
one transformation-language and the basics of the HTTP-protocol.
These issues are further discussed in the next section (see section 4.2).
Furthermore, FOXY offers the possibility to change or to
customize certain parts of the system, e.g. to use own implementations of
transformation-, filter-, or stylesheet-processing engines.
So, one may use a different XSL-processor than Xalan or replace the default
XQuery engine (Saxon) with one that may be more suitable for her needs.
Complexity
The complexity of FOXY mostly lies in the requirement of knowledge
of XSL or XQuery, respectively - both languages which are quite difficult to
learn for one who is not familiar with transformation- or
programming-languages at all.
In contrast, FOXY's rules- and patterns-languages are kept
very simple and therefore relatively easy to understand.
One possibility to reduce complexity is to create a new layer
of abstraction by implementing generalized stylesheets for common transformation-tasks,
e.g., the extraction of all forms of a web-page, the conversion from HTML- to WML-forms,
or even the (assumed) correct representation of a web-page
in WML (this requires the implementation of an ''intelligent''
algorithm in a stylesheet-language).
So, a non-expert user may just import and make use of these stylesheet-''templates''
whenever creating new transformation-rules.
An additional way to reduce configuration efforts is the implementation
of a graphical user interface (GUI) for the configuration files, an
enhancement which may be realized in the future (see section 4.4).
This GUI can also include functionalities for selecting important parts
of web pages for later transformation (see section
3.1.2), hence, not even basic knowledge of HTML may
be required to perform transformation tasks in the future.
In our prototype implementation, we did not mainly focus on the runtime performance
of the proxy server itself (although it was very sufficient) - after the solution
turns out to be stable we plan to implement caching functionality,
which should fairly reduce performance costs.
The HTML-tidying process costs a mentionable amount of time (only tested with JTidy),
while XSL- and XQuery-processing is usually quite fast.
A detailed analysis regarding the proxy server's performance, including
a comparison between different HTML-pretty-printers and stylesheet-processors
will be presented in the future - together with the implementation of the caching functionality
(see section 4.4.2).
Future Work
Graphical User Interface
In the next years, we plan to implement several graphical user interfaces (GUIs)
for FOXY, be it to ease configuration (and therefore usability)
or to support content extraction tasks.
The structure of the XML configuration files and their
consistent representation in Java (see section 2.2.2)
strongly supports the development of configuration-GUIs.
This graphical configuration utility should especially
be designed to meet the needs of non-expert users.
Additionally, to assist the user in the task of selecting important parts of
a web-page, a browser-like GUI that offers facilities
for selecting HTML-content in real time can be implemented
(This GUI could be combined with the mentioned configuration user-interface.).
It has to provide the user ways to graphically ''pick''
the parts (of a web page) of her liking directly in the browser window.
For example, Firefox ([17], Mozilla's ([18]) popular web browser
can be extended with add-ons.
These add-ons - which add new functionality to the browser - are
realized with the help of XUL ([19]), the XML User Interface Language.
This seems an efficient way to combine the browser's graphical capabilities
with FOXY's extraction functionality.
Another possible user interface may be be created together
with the implementation of specialized pre- or post-processing
modules, needed for automated content (or information, respectively)
extraction tasks (see section 4.4.5).
Caching
We decided to leave out caching functionality in the
prototype version.
Although the caching of recently requested pages (or also
pre-caching of popular ones) should largely increase
the server's runtime performance, we found out that even without
our system produces quite passable results regarding request- and
response-time, respectively.
In the future, we plan to implement the caching functionality as a separate
module that may be ''plugged in'' the server when needed.
Security
When speaking about security, we do not mainly think about the implementation
of authorization-facilities or the detection and prevention of (remote) attacks,
in fact we focus on the realization of HTTPS-functionality.
With the prototype it is not possible to access secure web pages/sites,
which reduces the application domain and therefore the flexibility
of the proxy server.
Content Extraction
Combined with an appropriate user interface, FOXY can easily be
used by home users to customize their view of the
World Wide Web. A user interface that supports this functionality
is further described in section 4.4.1.
FOXY can be used for the extraction of interesting
parts of web pages. In the future, it could be combined with a (specialized) crawler
and/or a post-processor to fulfill automated information extraction tasks.
The crawler, for example, may request all web pages through
the FOXY proxy server.
FOXY could then be configured to extract all (and only) forms of
all web pages requested through it to reduce the parsing and extraction
efforts of a possible post-processor, which purpose is the interpretation
of the gained information.
The configuration of the crawler, FOXY, and the post-processor may be combined in
a graphical user interface (see section 4.4.1).
Figure A.1:
W3C XML Schema of the patterns configuration file
|
Figure A.2:
W3C XML Schema of the rules configuration file
|
Figure A.3:
The XQuery-stylesheet needed to extract menus and login-form of http://www.gmx.com/,
(file gmx2html-table.xql)
|
- 1
-
Viktor Moser.
FOXY - A Proxy for Mobile Web Access.
Master's thesis, Technical University of Vienna, February 2006.
- 2
-
Dave Raggett, Arnaud Le Hors,Ian Jacobs.
HTML 4.01 Reference Specification, W3C Recommendation.
http://www.w3.org/TR/REC-html401/, December 1999.
- 3
-
WAP Forum.
Wireless Markup Language Specification.
http://www.wapforum.org/what/technical.htm, February 1998.
- 4
-
W3C, World-Wide Web Consortium.
XHTML 1.1 The Extensible HyperText Markup Language, W3C
Recommendation.
http://www.w3.org/TR/xhtml11/, May 2001.
- 5
-
W3C, World-Wide Web Consortium.
Extensible Markup Language (XML) 1.0.
http://www.w3.org/TR/1998/REC-xml-19980210, February 2000.
- 6
-
International Organization for Standardization.
ISO 8879:Standard Generalized Markup Language (SGML).
http://www.w3.org/MarkUp/SGML/, 1986.
- 7
-
W3C.
eXtensible Stylesheet Language 1.0 -
http://www.w3.org/TR/xsl/,
Jan. 2000.
- 8
-
W3C, World-Wide Web Consortium.
RFC2616, Hypertext Transfer Protocol - HTTP/1.1.
http://www.faqs.org/rfcs/rfc2616.html, June 1999.
- 9
-
W3C, World-Wide Web Consortium.
XML Schema.
http://www.w3.org/XML/Schema, May 2001.
- 10
-
Sun Microsystems.
The Java programming language.
http://java.sun.com.
- 11
-
Sun Microsystems.
Java Architecture for XML Binding (JAXB).
http://java.sun.com/webservices/jaxb/.
- 12
-
OSI - The Open Source Initiative.
The Open Source Definition.
http://www.opensource.org/docs/definition.php, 1997.
- 13
-
The Apache Software Foundation.
Xerces-J, Java XML Parser.
http://xerces.apache.org/xerces-j/, 2005.
- 14
-
Michael H. Kay.
SAXON, XSLT and XQuery Processor.
http://saxon.sourceforge.net/, 2005.
- 15
-
Fabrizio Giustina et al.
JTidy, HTML syntax checker and pretty printer.
http://jtidy.sourceforge.net/team-list.html, 2006.
- 16
-
Somik Raha, Derrick Oswald et al.
http://htmlparser.sourceforge.net/.
http://htmlparser.sourceforge.net/, 2005.
- 17
-
The Mozilla Project.
The Firefox Web Browser.
http://www.mozilla.com/firefox/.
- 18
-
The Mozilla Project.
The Mozilla Project.
http://www.mozilla.org, 2005.
- 19
-
The Mozilla Project.
XML User Interface Language (XUL).
http://www.mozilla.org/projects/xul/.
- Apache Software foundation
- Logging
- Automated Information Extraction
- see Information Extraction
- Caching
- Caching
- Complexity
- Complexity
- Configuration
- Rules
- The rules-Language
- Server Configuration File
- The Server Configuration File
- Content Extraction
- Content Extraction
- Databinding
- see JAXB
- Design Patterns
- Factory
- Parsers and Transformation Engines
- Singleton
- Parsers and Transformation Engines
- Extraction
- Automated
- see Automated Information Extraction
- Content
- see Content Extraction
- Information
- see Information Extraction
- Flexibility
- Flexibility
- FOXY
- FOXY - A Prototype
- Aims
- Aims
- Case Study
- Case Study
- Conclusion
- Conclusion and Future Work
- Configuration
- Configuration
- Download
- Architecture and Design
- General Design Concepts
- General Design Concepts
- Implementation
- Implementation
- JavaDoc
- Architecture and Design
- Overview
- The FOXY System
- Used Technologies and Tools
- Used Technologies and Tools
- Graphical User Interface
- Graphical User Interface
- HTMLParser
- Parsers and Transformation Engines
- Information Extraction
- Automated Information Extraction
- Java
- General Design Concepts
- JAXB
- Databinding
- JTidy
- Parsers and Transformation Engines
- Log4j
- Logging
- Logging
- see Log4j
- Parser
- HTMLParser
- see HTMLParser
- Saxon
- see Saxon
- Xerces
- see Xerces
- Patterns
- see FOXYConfiguration
- SAX
- foxy.proxy.filter
- Saxon
- Parsers and Transformation Engines
- Security
- Security
- SGML
- Problem Definition
- Transformation engine
- see Xalan
- Xalan
- Parsers and Transformation Engines
- Xerces
- Parsers and Transformation Engines
Footnotes
- ... data-binding-utilities2.1
- The term data-binding in this case refers to
the binding of an XML Schema to a representation in Java code.
root
2006-05-22