Programming Concepts & Challenges
Introduction
Earlier chapters have focused on the concepts and components that form the basis of grid environments. This chapter will
provide additional depth on how these can be leveraged in grid-enabled applications, and the challenges inherent in the
programming of such applications.
The advocates of grid computing promise a world where large, shared, scientific research instruments, experimental data,
numerical simulations, analysis tools, research and development platforms, as well as people, are closely coordinated
and integrated in 'virtual organizations'. Still, relatively few grid-enabled applications exist that exploit the full
potential of grid environments. This may be largely attributed to the difficulties faced by application developers in
trying to master the complex interplay of the various components, including resource reservation, security, accounting,
and communication. Moreover, typical grid middleware (e.g.,
Globus
[1], Condor
[2],
and Unicore
[3])
provide relatively complex programming interfaces and are still in the development phase such that significantly
new software releases appear frequently.
Dealing with complex and changing programming interfaces is difficult in and of itself and partially
responsible for the fact that few applications have been grid-enabled. An additional aspect of the problem is that
we are still learning how applications in general can benefit from running on a grid and the best ways to optimize
individual applications to take maximum advantage of the grid environment. Unlike homogeneous parallel machines or
clusters, grid environments are heterogeneous and dynamic in nature, and subject to change at various levels:
-
on the hardware level, where the application programmer has to deal with different computer architectures, chipsets,
execution speeds and models,
-
on the software level, including different operating systems (and versions), different compilers, inhomogeneous software
environments, etc., and
-
on the administrative level, where the programmer faces various and incompatible administrative policies between
different grid resources.
The current grid scenario consists of:
- services (and interfaces) that are upgraded on a regular basis
- institutions (i.e. resources, services, applications) that join and leave a grid without much notice
- Changes in the application environment at run time, including services that go down without warning, resources that
get busy or become available without notice, and fluctuations in the capacity of available storage.
In grid environments, conditions constantly change and at a far greater rate than situations where activities are being
controlled under a single administrative domain. Today's grid middleware allows you to cope with these changes but
addressing them in the most effective way can be a major programming effort. In the end, a grid-enabled application
requires additional code for handling transient problems, and portions of the application code can require very
frequent maintenance.
Since environmental components can change unexpectedly, such changes can easily break applications that rely on a
concrete configuration, or invalidate the results of such applications. Furthermore, to run efficiently, grid applications
need to be scheduled and then executed in such a manner that the differences of the resources that are actually
used are properly taken into account.
The bottom line is that application programmers have to incorporate completely new and complex paradigms into their
applications, which requires significant experience and effort due to the steep learning curve. Additionally we have to
take into account the fact that the average application programmer is not a grid expert, but typically a domain expert
wishing to solve domain-related problems.
Most applications share these problems, but code reuse is very difficult, if not impossible, because of the fundamental
differences in the way applications are written and the need to make use of different grid features. Reusing
Globus
[1]
or other middleware-oriented libraries is surely an option, but in the end nearly all grid application programmers
gravitate towards creating their own abstraction layers on top of these libraries.
Ideally grid applications would adapt to the changing environment, discover required grid services at run time, and use
them as needed, independent of the particular interfaces used by the application programmer. Unfortunately this is not
possible in grid environments today, mainly because of the lack of standardized, widely adopted programming interfaces
that can hide most of these complexities from the programmer.
Application interfaces today
All the grid services and middleware systems described earlier offer some form of programming interface, encompassing a
large variety of technologies.
SOAP
[4]
(Simple Object Access Protocol)
services provide a
SDL
[5]
(Specification and Description Language)
description. Other services (for instance
GridFTP
[6])
can be accessed via a well defined protocol, or a client side C API.
Yet others feature a rather complex API but with a set of easy-to-use user level tools (for instance
GRAM
[7]).
In general, the diversity of the technologies is very broad but, for each service or concept, there exists an API or
programming framework designed to support that particular approach.
The overall picture has improved slightly with the emerging web services technologies. The
W3
[8]
consortium has defined several standards that provide at least a unified syntactic description of the particular API (via
WSDL
[9] and WSRF
[10])
and the standardization of SOAP
[5] provides a unified
transport layer for these.
But even if web services solve some of the diverse problems and are helping to establish a common service infrastructure,
they do not solve the problem of having many different service API's for similar purposes. Thus, learning (and teaching!)
programming concepts for grid technology involves understanding a number of different frameworks and APIs.
The dominant APIs and technologies today are Globus
[1],
Condor
[12],
Unicode
[13] and
WSDL/WSRF based services.
Working with specific grid services
Though the grid service landscape may appear diverse at first glance, many concepts and patterns are repeated or
heavily complement or overlap one another. For instance, job submission is almost always rendered in some form of
(1) describing the application, (2) describing the resources to use, and (3) submitting these descriptions to an
execution service that has the required executable running on a target machine. Some of the common concepts are:
-
Security: All grid-related activities need to comply with certain security needs of the application users. Making sure
confidential information stays accessible only to correctly identified and authorized persons is crucial to modern
computation and even more so for grid related applications
-
File handling: This includes (1) handling of files as a whole (copying, moving, deleting of files), (2) accessing the
file content (reading and writing) and (3) querying for different file characteristics such as name, size, and details
on last access. The concepts of file handling are very well understood in current operating systems and are generally
reused in the grid context.
-
Replica handling: This pertains to the handling of file replicas, which are created to provide additional reliability
or scalability. The service maintains and provides access to mapping information from a logical (arbitrary) file name
to a target (actual) file name. The target file name typically represents the physical location of the data. This allows
for the abstraction and separation of an application's execution from concrete names in a local file system. Additionally
it provides means of providing data replication management in grids
-
Information services: This provides for both the discovery and monitoring of grid resources (compute, network, storage, etc.)
and services, including information on what services may be available from the different resources, and the state of
resources or services at a point in time.
-
Inter-process communication: This concept covers the information exchange between separate jobs generally running on
different resources (remote procedure calls (RPC), monitoring and notification, data transfer, etc.). Most of this is
well understood in established operating systems and programming languages, and generally reused in the grid context.
-
Workflow management: This concept defines how an application flow or business process may be automated, in whole or part.
A workflow includes the documents, information or tasks that are passed from one participant (or component) to another
for action, according to a set of procedural rules and dependencies.
Additional detail and examples are provided later in this section.
As often seen in computer technologies, a level of indirection, or an
abstraction layer
[14],
can be used to resolve some problems. The definition of a general grid API can expose common paradigms and can help
shield the application programmer from unwanted dynamics and technological details that arise from having a multitude of
implementations.
Many of the grid toolkits used today try to present a generic API that can provide a useful level of functionality to
support a variety of applications and use cases. Others (i.e. Globus, Condor) are left open for extensions, which
provides for flexibility over time but can introduce interoperability problems if compatibility is not maintained
across version changes.
In addition to the described programming API, many toolkits provide stand-alone tools that are usable for task
execution. The tool-based approach seems to be more stable, since it can manage API interoperability issues rather
than leave these to the programmer. For instance, the globus-job-submit tool has not changed much through progressive
Globus versions since implementation changes were handled by the toolkit itself.
While APIs and toolkits are easing the development of applications to interact with specific grid services, there is
still a high degree of incompatibility in the running of applications and commands across different grid environments
(e.g., Globus and Unicore). A uniform approach to grid APIs — or the development of a single grid API, could
reduce the variation experienced across current technologies and bring the focus instead to programming techniques
and advantages for running on a grid. Advantages of this approach are obvious:
-
users and programmers will have a lower learning curve because they can focus on concepts only, not on concrete
implementation techniques
-
a great amount of uniformity will be provided with regard to different grid middleware toolkits that have similar
functionality and concepts.
The uniform API approach does have a limitation in that it is a generalization over existing API's and so would
hide details and specifics. Nevertheless, experience with existing grid-enabling toolkits such as the Grid
Application Toolkit
(GAT
[15])
or the Simple API for Grid Applications
SAGA
[17]
standardization effort at the Open Grid Forum
(OGF
[18])
shows that high level, application oriented programming interfaces provide a sensible way of tackling the above
mentioned problems in today's grid application landscape. This approach promises easier development and adapting
of applications to run in grid environments, and the possibility of building a common and widely available grid API.
Most importantly, the key objective for a grid application interface is simplicity for the application programmer.
It should be easy to use and also easy to install, administer and maintain. Remember: an applications programmer
is most often a "typical" domain scientist — a physicist, chemist, biologist, linguist, or similar.
Access to information about resources - Information services
Whether you are a grid administrator or a grid user, having access to up-to-date information about the status of the
grid is critical since network connections may be unreliable, resources within a vast and distributed collection may come
and go, and virtual organizations can be dynamic. The grid users need to be able to determine
which resources on the grid are relevant to their application
requirements and available at any given time. The
grid administrators must be able to monitor the "health" of the grid under
their watch and make certain details of the grid available to the grid users.
Standardized grid information services collect a great deal of (even customized)
information
about the grid, provide ways to query against that data, and then present
the results in associated tables.
The Globus Toolkit Information Service Monitoring
and Discovery System
[22]
(MDS) is probably one of the most well-known
information services. The MDS system provides the capability to monitor and discover what resources are on the grid,
and report status about resources as they are being used. For example, you may want to discover what computers are available,
what the processor architectures are in each computer, what schedulers are in use, and what sort of load (compute, memory,
disk, and so forth) is on each computer. Likewise you may need to monitor the resources on the grid to observe your job
running and make sure it isn't experiencing any problems. Resource properties appropriate to specific monitoring and
discovery needs can be defined via services such as GRAM, RFT, GridFTP, and RLS.
MDS collects information across multiple, distributed resources on a grid
via aggregator services that collect real-time
(or fairly recent) state information from registered information sources
into an index. Collections of information can be queried Through various interfaces
(browser, command line, and Web services.) The most recent version of MDS, MDS v4, uses XML and Web services
interfaces to register the sources and locate and access information. This
framework includes 1) explicit registration of the information source with
the aggregator service, 2) expiration (automatic cleaning out) of registrations
not renewed periodically, 3) collection by aggregator of up-to-date information
from all registered information sources, and
4) support for query and publication of results.
As mentioned before, the "MDS-Index" service collects information on various
Globus services and other protocol specific sources and then makes the data
available in XML-based properties that can then be queried and published
with standardized access methods. The data is published according to a schema
that has been defined by the author/administrator or, in the case of a multi-institutional
distributed grid, the collaborators. An example of the later is a schema
called Grid
Laboratory Uniform Environment
[23]
(GLUE). GLUE was developed by
DataTAG
[24]
for interoperability between European and US Grids and is now under the Glue Working Group in the Open
Grid Forum at GridForge
[25].)
There are three ways of viewing MDS data: 1) Write
your own application (in C, Java, Python, or .NET) using the standard
Web services interface. 2) Use the command line tool wsrf-get-property.
3) Use the WebMDS tool, which is highly configurable, to view data using a standard web browser.
Figure PC-1. A view of grid information ala MDS.
MDS also includes a trigger service that allows definition of rules for
actions on the data (such as to whom email should be sent and what should
be sent to them) and core Web services security to handle issues like who can
access the indexes and data.
See A
Globus Primer
[26],
the MDS web pages
[27],
and Globus
Monitoring and Discover (2005 Globus World)
[28]
for more details about
and features within MDS.
The European GridLab
[29]
project, in collaboration with US researchers at LSU,
has developed an information service called iGrid
[30].
The iGrid distributed
architecture is based on two kind of Information Services, iServe and
iStore GSI-enabled web services. The iServe services supplies information about
a specific resource, while the iStore services aggregates information coming from
registered iServe. iGrid is based on a relational DBMS
and utilizes an efficient information caching
policy. It can handle information extracted directly from the computational
resource, where the server is running, and also user-supplied information.
Thus iGrid has both system information providers and user information providers.
The system provides information in XML
format,
while the user provides information via a web service registration method.
The web service itself is based on the
gSOAP
toolkit, the GSI plug-in for gSOAP and the GrelC library. A push model is
used to supply information to iStore from iServe services.
Job submission and management Two services are typically included in a computing system for processing
jobs -- a job manager and a job scheduler. Sometimes these functions are handled
by separate tools. In other cases one tool may have components that serve
both functions. In this section we will give you a description of each service
and some examples
of software that performs these functions on a grid.
A job manager enables the site or grid administrator to define and enforce procedures and policies
for running jobs on a resource based on a wide range of properties such as computing system or type, user
groups, priorities, run time, queue types and lengths, and so forth. The job manager also provides
the end user with methods for submitting, monitoring, and controlling jobs. In some cases the end user can
define policies within his or her own collection of jobs. On a grid, local resource/job managers communicate
with a global resource manager in order to provide status information to
all administrators and users across the grid.
The job scheduler matches the job with the appropriate resources according to the requirements specified
by the user. The requirements can include items like cpu type and number,
run and/or wall time, memory and/or disk, restarts, checkpoints, and so forth.
(And, in some cases, the job manager or scheduler can remove a job if the job requirements have been incorrectly
specified.)
Job schedulers include products like PBSPro
[50],
and OpenPBS
[35],
LSF
[36],
LoadLeveler
[37],
Maui
[38],
Moab
[39],
and Globus
Resource Allocation Manager
[40] (GRAM). Job
management is also included in PBSPro and OpenPBS, GRAM, LSF, and
Torque
[41].
For example, PBSPro (from Altair Engineering) includes user commands such as qsub (submit
job), qstat (check status of machine, queues, jobs), and qdel (delete
job) for user management of jobs. A simple PBSPro job submission file would
look something like
#PBS -N Strato-ozone
#PBS -l ncpus=128
#PBS -q flicker
#PBS -k oe
#PBS -m abe
cd ~/ozone
mpirun -np 128 transform
In this case:
- The job name is Strato-ozone (-N).
- The job requires 128 processors (ncpus) and is looking for a queue named flicker (-q).
- Standard standard output (o) and standard error (e) files should be kept (-k).
- The job owner wants email (-m) to be sent when the job begins (b), ends (e), or aborts (a).
- The job is in the owner's directory named ozone.
- The executable is named transform and has been developed as an MPI application (mpirun).
PBS also includes an X-Windowed interface, called xPBS. A job submission dialog interface can be used along with an interface
where you can monitor hosts, queues, and jobs.
Figure PC-3. XPBS job submission interface.
Figure PC-4. XPBS server, queue, and job information interface.
Altair Engineering also offers a browser portal
called e-Compute
[42]
that works with PBSPro. Likewise, PBSPro provides for command
line and windowed interfaces for the administrator to define queues and policies
and to monitor the environment and health of the resources
being managed.
The Condor project at the University of Wisconsin-Madison provides the ability to join collections
of workstations and clusters together into a distributed high-throughput
computing facility. Condor is also a resource scheduling system and management system for the collected resources.
Condor has mechanisms for matchmaking to select an appropriate computer for a job, checkpointing and
migration of jobs for reliability, running parallel jobs, and for running large workflows.
Condor can handle large numbers of jobs plus
inter-job dependencies and both user and administrator defined job priorities.
Condor jobs run in a number of pre-defined batch "universes", which specify how jobs are to be run
(regular job, job with checkpointing, parallel job, etc.). Jobs are described
in a scripting fashion similar to PBSPro and then submitted in a batch
or background mode. A simple job description file would be:
# Example condor_submit input file
Universe = vanilla
Executable = /home/ozone/condor/transform.condor
Input = transform.stdin
Output = transform.stdout
Error = transform.stderr
Arguments = -arg1 -arg2
InitialDir = /home/ozone/condor/run
Queue
This file is then submitted to the universe via the condor_submit command line. The condor_submit command initiates
parsing of the file and creation of a "ClassAd" that describes the job in terms of hardware architecture, operating
system, memory, disk, and so forth. This ClassAd is then sent to the scheduler,which stores the job in its queue.
Queues can be viewed with the condor_q command.
Condor submit files can describe multiple jobs which then become a "cluster"
of jobs when submitted. Each job within a cluster is called a "process".
This sort of feature is particularly useful in applications that require
simple processing across hundreds of data files.
Condor includes additional commands to remove jobs (condor_rm),
temporarily halt (condor_hold) and release (condor_release)
a job, see the history of past jobs (condor_history), and specify
priority order of your jobs (condor_prio). The Condor JobMonitor
provides a viewer for job progress. Scripting options are available to enable email
notification, log files, and more.
Condor can also schedule non-Condor resources through the grid-enabled version, Condor-G.
In a typical scenario, Condor is layered over Globus to provide a "personal batch system" for the grid.
Figure PC-5. Condor/Condor-G scheduling system.
Condor-G maintains information to provide fault tolerance in case of local
or remote crashes or network problems. It also provides a service called
"GlideIn" that makes a wide-area grid appear to be a single Condor pool, and allows all of the Condor
features, such as matchmaking, checkpointing and remote I/O, to work naturally in a grid environment.
Condor-G can also submit and manage jobs to Nordugrid
[43],
Oracle Database, Unicore, PBS, LSF, and remote Condor pools.
An excellent tutorial on Condor can be found at
Condor User Tutorial, UK Condor Week, NeSC, October, 2004
[44].
The Condor manual
[45]
is also located at the Condor home page
[12].
Advance Reservation
While schedulers and job managers continue to develop and improve, the advent of their use on distributed systems
such as grids has caused interest in the concept of "advance reservation". As developing applications require more
complex computational capabilities and significantly longer run-times, the ability to assure resources to successfully
complete a job is becoming increasingly important.
Noteable approaches to advance reservation include:
- The AIST
Grid Scheduling System
[46]
(GRS) for co-allocation of computing and networking resources. This approach consists of three components: a computing
resource manager, a network resource manager, and a grid resource
scheduler that handles requests from users via the other two.
- The NAREGI
GridVM
[47]
which provides a virtual execution environment and advanced registration of compute nodes.
- Keahey's Virtual Workspace
[48]
which is an execution environment in terms
of the hardware and software components required. These workspaces can
be implemented in a number of ways with advance registration being explored.
- Globus
Toolkit GRAM
[49]
which allows users to create and manage advance
registration by leveraging the control provided by local resource managers.
For example, under GRAM, the reservation is a separate entity with a reservation ID. A grid user can request the
reservation of specific resources for a period of time. The reservation has a specified lifetime and multiple jobs can be
bound to the reservation throughout this lifetime, by the reservation owner via the reservation ID. A simple image
depicting this advance registration process is provided by the Globus Alliance. Figure detail shows a client (user or
administrator) creating and managing reservations through an Advanced Registration System (ARS) and Master Job Scheduler
(MJS) that communicate through an adapter with the Local Resource Manager (such as PBSPro, LSF, Maui, etc.):
Figure PC-6. Globus advance reservation system.
Data access, movement, and storage Those interested in grid computing may be looking for increased computational capabilities but very
frequently also have a need to process large amounts of data. To insure the movement of data where and when
needed in a grid environment, bandwidth between disk, cache, memory, and CPUmust be considered.
A number of services are available to manage data in a grid environment, but they vary quite a bit within the context
of different grid projects.
The Globus Toolkit GT4 divides the concept of data management into two categories:
data movement and data replication. Data movement is handled by two services.
GridFTP
[51]:
GridFTP is a protocol defined by the Global Grid Forum. The toolkit
provides a server implementation (with Data Storage Interface options for
POSIX, SRB, HPSS, and Condor NeST systems), a command line client, and a
set of development libraries for custom clients.
The command line client is called globus-url-copy and uses the
standard get and put approach of standard ftp. For example,
globus-url-copy -vp -tcp-bs 5551234 -p 4 file:///mydir/mydata gsiftp://faraway.site.org/tmp/mydata
will put the file "mydata" at my local machine to file "mydata" at the /tmp
directory on a machine named "faraway" at site.org. Note that globus-rul-copy
does not run interactively and should be part of a job script. Alternately,
to get a file back, the "file" and "gsiftp" parameters are simply switched
in order on the command line. Third party transfers are also supported. In
this case, both files appear associated by the "gsiftp" parameter:
globus-url-copy -vp -tcp-bs 5551234 -p 4 gsiftp:///faraway.right.org/mydata
gsiftp://faraway.left.org/tmp/mydata
While GridFTP maintains a familiar concept in file transfer, it is not a
web service protocol. GridFTP also requires an open socket throughout the
transfer, meaning that a failure on either end cannot be recovered, which can be particularly problematic for
large file transfers.
Reliable File Transfer
[52]
(RFT): RFT is a part of the web services
framework and therefore provides more functionality in data movement. RFT
uses standard SOAP messages over HTTP
to submit and manage a set of 3rd party GridFTP transfers and to delete
files using GridFTP. By submitting a list
of URL pairs, the user can specify which files are to be transferred or
deleted. Using this approach, the files are created after the user is properly
authorized and authenticated. And since RFT keeps transfer state in a PostgreSQL
database, the file transfer is recoverable in case of any failures.
There is currently no GUI interface for RFT and various command line examples
can be found at the
GT
4.0 RFT Command Reference
[53] page.
Data replication is currently handled by the Replica Location Service (RLS).
RLS is a simple registry that records where replicas exist on physical storage
systems. The users of the system register the replicas and can follow up
later with queries to find them. RLS is a distributed registry, making it
more scalable and less vulnerable to single-point-failures (though it can
be implemented as a centralized registry if preferred.) RLS maintains mappings
between a logical file name and the associated physical replicas. Data replicas
are very helpful in situations where large collections of data are used frequently
by a group of people across distributed resources. For more information,
see GT
4.0 RLS
[54].
Condor
[12]
includes a software network called Network
Storage Technology
[55]
(NeST) which negotiates guaranteed storage allocations
(or contracts), in terms of "lots", between users and servers for specified
periods of time. NeST provides flexibility in terms of
size and duration of these lots as well as hierarchies (called sublots) and
both user and group access control options. NeST provides multiple interfaces
including protocols for HTTP, GSI-FTP
(a Globus GridFTP collaboration), NFS, and its own "Chirp". And NeST provides
administrators with the ability to define limits and policy as well as the automatic
reclamation of storage at the end of the "contract".
Figure PC-7. Condor NeST architecture.
Operations on a Lot include
- create, delete, and update
- movefile
- adduser, remove user
- attach/detach (binding to specific file or path)
More information on NeST is available in the following paper from the development team:
Flexibility,
Manageability, and Performance in a Grid Storage Appliance
[56].
Reporting grid usage
Gratia and SweGrid Accounting System (SGAS) are examples of grid-wide accounting packages that are capable of
meeting the needs of a large-scale grid today.
Gratia
Gratia is in large-scale operation on the Open Science Grid to collect accounting information. The
software was developed for the Open Science Grid to meet several system requirements, as documented in the
[59]:
"The Grid Accounting Project has:
- designed the schema for the accounting attributes,
- is ensuring the necessary collectors and sensors are in place in the resource providers,
- has defined and is deploying repository and access tools for the reporting
and analysis of the grid wide accounting information.
The Accounting system will properly
determine a confidence level in the existing accounting information and adequately
address and present erroneous or missing accounting data.
The accounting system will adequately protect the privacy
of the users and organizations involved.
The auditing system will use information
from the accounting system and link it to information from other sources
to allow full tracking and
analysis of the actions and events related to a user's resource usage.
The auditing system needs to be able to present the immediate and
short term information of the state and transitions in a user's
use of a resource.
The initial main goal for the accounting system will
be to track VO members' resource usage and to present that information in
a consistent Grid-wide view, focusing in particular on CPU and Disk Storage utilization".
Data is collected via a standard process, running on
each node, which generates daily usage logs containing information on the jobs
that ran and how many resources they consumed. This data can be used by Gratia for accounting purposes, and
needs to be sent to the Gratia collector to be stored in a reporting database.
The purpose of the probe is to read generated files and convert
them to usage records that the Gratia program can then send to the Gratia
collector.
Figure PC-8. The Gratia architecture.
Gratia collects job counts and wall/cpu time used by a user, for a site, and for a VO.
Figure PC-9. An example Gratia report.
Installation and implementation information as well as the Gratia mailing list may be found at the twiki page.
See the
Full Project Definition
[60]
for additional information as well.
SGAS
The SweGrid Accounting System
[61]
(SGAS) is a Java implementation of a resource allocation
enforcement and tracking service based on the latest Web services
technologies. SGAS is a soft-state, non-intrusive Grid accounting solution that includes logging
and tracking in GGF Usage Record XML format and a remote and scriptable management
interface.
SGAS is made up of several components:
- Bank - the central service of the accounting system that maintains
and enforces allocation quotas.
- Logging and Usage Tracking Service (LUTS) - a general purpose logging system for tracking resource usage in SGAS. It
allows secure publication and query-based retrieval of usage data in the format of GGF UsageRecord
XML.
- Job Account Reservation Manager (JARM) - a component responsible
for integrating various workload managers, schedulers and local accounting
systems deployed at the resource sites with SGAS. JARM is typically used as a callout
to the bank during the job submission phase. The bank then issues a
time-limited reservation to run the job, based on user, resource and bank
policy. After the job has completed the job is logged in LUTS, and if a valid account
reservation was made, JARM also charges the account in the Bank, and releases
the reservation on behalf of the resource.
- Policy Administration Tool (PAT) - a component
designed to manage the security policies of all of the
SGAS services. It contains a command
line tool that can be run in interactive or batch mode for easy scripting.
Figure PC-10. The SGAS architecture.
SGAS runs on all platforms supporting JRE 1.5.
The Globus Toolkit (GT 4) includes SGAS as part of the the available
download.
[62].
The BalticGrid project is using SGAS in conjunction with Globus. Their Virtual User System (VUS) requires few authorization
mechanisms (VOMS, gridmap file, banned list,
and SGAG) and handles privilege enforcement on several levels (meta scheduler,
operation system and local scheduler, and application), with a job/account
isolation level. Data is stored in the context of global user identity and
VO and data is gathered for VOs as well as resource owners. Their system can
be summarized in the following diagram.
Figure PC-11. The Globus and SGAS connection.
See the SGAS Accounting System
Installation and Administration Guide
[63]
and Administration Guide
[64]
for more information.
Upcoming challenges for these accounting systems include things like understanding
the discrepancies between different tools, sometimes obscured by failures in
the collection processes. Since the grid computing systems are made up of such
diverse processors and architectures, some form of normalization is under investigation.
Such tools might include better descriptions of the processors to the accounting
system, performance index information, and so forth. Data transport is another
area of interest. The
OSG Gratia Project
[65]
is looking at these things.
Workflow processing According to Wikipedia, "Workflow at its simplest is the movement of documents
and/or tasks through a work process. More specifically, workflow is the operational
aspect of a work procedure: how tasks are structured, who performs them,
what their relative order is, how they are synchronized, how information
flows to support the tasks (wordflow) and how tasks are being tracked. As
the dimension of time is considered in workflow, workflow considers "throughput" as
a distinct measure. Workflow problems can be modeled and analyzed using graph-based
formalisms like Petri nets."
Clearly, as computing resources become available in more distributed fashion,
as the tools to use them expand, and as our expertise in using them develops,
the concept of managing our workflow across this grid of software and hardware
resources emerges. In this section we will give several examples of grid
products and services that support this concept of workflow management.
Condor DAGman
Condor's Directed Acyclic Graph Manager
[66]
(DAGman) allows you to specify the dependencies
between Condor jobs. (For example, jobs can be ordered chronologically.)
DAGman works through a data structure called the "DAG", a dependency
graph where each job is a node and can have multiple parents and
children providing no loops are created. A DAG is created via a ".dag" file
such as:
# ozone.dag
Job A jacobian.sub
Job B vadvection.sub
Job C xadvection.sub
Job D xdiff.sub
Job E thdiff.sub
Job F predict.sub
Parent A Child B C
Parent B C Child D, E
Parent D E Child F
Here we have six condor jobs (Job A through Job F) where
each job is specified in its associated condor_submit file (jacobian.sub through predict.sub.)
The DAG is put into action via the condor_submit_dag command which runs DAGMan itself as a Condor job
to benefit from Condor's reliability mechanisms. (See the "Job submission and
management" section above for a brief example of the Condor universe.)
A visualization of this process, from the Condor tutorial job, shows that Parent A starts and, upon completion,
child jobs B and C start up.
# diamond.dag
Job A a.sub
Job B b.sub
Job C c.sub
Job D d.sub
Parent A Child B C
Parent B C Child D
Figure PC-12. DAGman workflow diagram.
Figure PC-13. DAGman workflow progress.
The process continues until all jobs complete.
In case of failure at any step, DAGman will continue as far as possible
and then create a "Rescue File" which holds the current state of the DAG
job. Once the problem has been resolved, the rescue file can be used to restore
the DAG to its prior state. DAGman will continue in this manner until the
entire DAG job completes.
DAGMan has been used to run DAGs of tens of thousands of nodes in production. When running so many nodes on a grid,
many failures are likely to occur, therefore DAGMan provides a variety of features to reliably run and scale large DAGs.
See the Condor manual
[67]
for a complete list of DAGman features.
Swift
Swift
[68]
is a system that builds on and includes technology previously distributed
as the GriPhyN Virtual Data System
[69].
It provides for the specification, execution, and management of large-scale science and engineering
workflows.
It supports
applications that execute many tasks coupled by disk-resident datasets,
for example, when analyzing large quantities of data or performing parameter
studies or ensemble simulations.
Swift is open source software that combines:
- a simple scripting language to enable
the concise, high-level specifications of complex parallel computations,
and mappers for accessing diverse
data formats in a convenient manner. Simple examples of use can be
seen at A Swift Tutorial
[70].
Swift also provides visualizations through generation
of provenance graphs. Swift scripts can be run locally or on remote systems.
The same script files can be used in both cases with modifications made
only to a "site catalog" file that is in XML format.
- an execution engine that can manage
the dispatch of many (100,000+) tasks to many (1000+) processors,
whether on parallel computers,
campus grids,
or multi-site grids. The runtime engine is configured through properties.
Properties are define at the global, user, and command line levels.
Properties include such things as site and transformation catalog
locations, IP address of GRAM service, caching algorithm information,
provenance graph settings, job clustering information and settings
for kickstart information gathering and throttling (setting limits
for concurrent activities such as workflow instances, tasks/jobs,
file transfers, and so forth).
For more information on swift, see
The SwiftScript User Guide
[71]
and the
Swiftscript Language Reference Manual
[72].
Pegasus
Planning for Execution in Grids
[73]
(Pegasus) is a workflow-mapping engine developed
and used as part of several NSF ITR projects (GriPhyN, NVO, and SCEC-CME).
Pegasus automatically maps high-level workflow descriptions onto distributed
infrastructures such as the TeraGrid and Open Science Grid. Pegasus:
- enables scientists to construct workflows in abstract terms without
worrying about the details of the underlying Cyberinfrastructure,
- provides
robustness and reliability through dynamic workflow remapping,
- automatically
manages data generated during workflow execution and capturing their
provenance information,
- is used in a variety of scientific applications
ranging from astronomy, biology, earthquake science, gravitational-wave
physics and others,
- is used day-to-day to map complex, large-scale scientific
workflows with thousands of tasks processing TeraBytes of data onto the
Grid.
Pegasus improves the performance of applications through data reuse to avoid
duplication and increase reliability, workflow restructuring to improve resource
allocation, and automated task and data transfer scheduling. Pegasus provides
reliability through dynamic workflow remapping and DAGman workflow execution.
Pegasus uses Condor and Globus middleware for distributed environments and:
- provides a level of abstraction above gridftp, condor-submit, globus-job-run,
etc commands
- provides automated mapping and execution (via DAGMan) of workflow
applications onto distributed resources
- manages data files, can store and catalog intermediate and final data
products
- improves
successful application execution
- improves application performance
- provides provenance tracking capabilities
- supplies client-side tools
- provides an OSG-aware workflow management tool
Pegasus usage examples are beyond the scope of version 1 of this cookboo but can be found in the
GriPhyN Virtual Data System Quick Guide
[74].
Security and security integration through authn/authz
Today we are encouraged to use more and more Internet-based services, be they online banking, concert
ticket purchases, attractive free software options, instant messaging, blogs and wikis, or grid computing. As these
services and their sources proliferate, the more nervous we might (and should) become about whether or not
interaction with a service is
secure. Available grid technologies provide various levels of and mechanisms for security within the grid environment
they support.
As has been discussed already, grids often result when multiple institutions form virtual organizations to
accomplish tasks beyond the ability of any single institution. These virtual organizations are structured with
different members having various privileges, often based on roles and relying on agreed-upon methods to
determine and enforce both roles and privileges. For example, some members might only have the right to develop and
run software, while others might serve as community administrators. Through these processes, virtual organizations are
better able to maintain the integrity of their resources and data, at least with respect to the VO itself.
The more difficult aspect is to maintain security across grid environments, or for resources that are connected
to more than one grid environment. Standards development in several related security areas is taking place within
the Open Grid Forum, including user authentication, authorization and
firewall issues
[76].
In addition, organizations such as the IGTF (International Grid Trust Federation), are working to synchronize
policies across grid initiatives in order to develop and maintain a global "trust fabric" that supports scaleable and
reliable identification of grid users and resources
[77].
As this work is ongoing, research projects are also underway to develop necessary supporting architecture and software.
One notable example is the
GridShib project
[31],
an NSF funded project of NCSA and the University of Chicago, to integrate the federated authorization infrastructure of
Shibboleth
[32]
with the Globus Toolkit.
As discussed earlier in the Cookbook, the Globus Toolkit provides security via X.509 credentials. Identity-based
authorization is provided via access control lists ("gridmaps") mapping to local identities (Unix logins) and a
Community Authorization Service (CAS). The Shibboleth project offers a large base of campus use around the world
via a standards-based and open source implementation and a standard vocabulary for describing user attributes. With this,
Shibboleth has resulted in a well-developed, federated identity management structure.
From the
GridShib website
[31],
"The goal of GridShib is to allow interoperability
between the Globus Toolkit® from the Globus
Alliance
[1]
with Shibboleth
[32]
from Internet2
[33]. As a
result, GridShib enables secure attribute sharing between Grid virtual
organizations and higher-educational institutions." GridShib provides attribute-based
authorization based on Shibboleth.
In addition, while basic security measures are in place to support virtual
organizations, the size and complexity of the virtual organizations
they can support are limited by the ability of resource managers to manage
the privileges of each user in the virtual organization.
To address this scaling issue, using GridShib, virtual organizations can
use access control methods based on user attributes instead of identity.
As a result, resource managers need not know all of the users in the virtual organization, just
their attributes (for example, Data Analyst or Software Developer).
The GridShib project has five basic goals:
- Integrate X.509 and
SAML
[75]
to provide enhanced Grid Security Infrastructure
(GSI).
- Enable attribute sharing between virtual organizations and higher-educational
institutions.
- Develop and implement profiles to securely share attributes
across administrative domains.
- Investigate attribute-based access policy
enforcement for grids.
- Generalize attribute-based authorization policies
in the Globus Toolkit runtime environment.
And GridShib has developed around three use cases: established grid user,
new grid user, and portal grid user. See the Technical
Overview
[34]
for more details on current use cases and plug-ins.
Figure PC-2. The GridShib relationship.
Grid-enabling application toolkits
Overview of existing frameworks
There have been several attempts to build grid enabling application toolkits in the past. These toolkits aim to provide client side abstraction layers mainly for grid middleware services and related dynamic 'features' in order to increase both the speed and ease with which grid applications can be deployed. Figure PC-14 illustrates the place of an application within a grid environment and its main interface to the grid (shown in red),which is the focus area for application-oriented grid-enabling toolkits.

Figure PC-14. An application in a grid infrastructure.
Experience in the development of different grid enabling application toolkits suggests that a required
main feature is that they be easy to use. Ease of use includes:
- Exposing a simple and consistent API which allows error tracing to be invariant,
- Making upgrades easy to perform and not reliant on specific versions of grid middleware,
- Exposing a well defined API that is designed to change rarely and to be upward compatible if changes are required,
- Supporting implementations that allow dynamic exchange of key elements (possibly at runtime) and provide runtime abstractions,
- Avoiding refactoring/recoding/recompilation whenever some underlying middleware component may have been changed,
- Ideally, a grid application should be designed to run reliably locally and on the grid, over time, and in light of
differences encountered in the grid environment (e.g. the operating system of various resources, versions of grid middleware, etc.).
Applications should also utilize well-known programming paradigms. For example,
a file API should provide expected functionality (namely open, close, read, write, seek) versus the
introduction of less-straightforward mechanisms (e.g., asking a discovery middleware service to tell the
application the location of a middleware service that can then give the location of the requested file).
Toolkit example: Simple API for Grid Applications (SAGA) SAGA (Simple API for Grid Applications) has been defined by the
Open Grid Forum (OGF)
[17]
as a high-level API that directly addresses
the needs of application developers.The purpose of SAGA is two-fold:
- Provide a simple API that can be used with much less effort compared to
the vanilla interfaces of existing grid middleware. A guiding principle for
achieving this simplicity is the 80-20 rule: serve 80% of the use cases with
20% of the effort needed for serving 100% of all possible requirements and
- Provide a standardized, portable, common interface across various grid middleware
systems and their versions.
SAGA is a prominent recent API standardization effort that intends
to simplify the development of grid-enabled applications,
even for scientists having no background in computer
science or grid computing. SAGA was heavily influenced
by the work undertaken in the Gridlab
[19]
project, in particular
by the Grid Application Toolkit (GAT)
[15]
— one of
the first major attempts to build a high level API to grid
services. A public call for use cases produced about 25 different examples that served as input to the SAGA development team.
The following examples of code show typical SAGA use cases and illustrate the intended simplicity of SAGA.
The code is based on the
SAGA C++ reference implementation
[21]
currently developed at the Center for Computation and Technology at LSU, Baton Rouge. Note, that the code presented is completely independant from the underlying middleware services that are used.
Figure PC-15. Asynchronous bulk file copy in SAGA: a file is copied to a number of remote locations using
well-known C++ programming paradigms.
Figure PC-15 above shows how easy it is to copy a set of files to different remote locations. Interestingly enough this code
even applies certain optimization techniques, for instance the use of bulk operations if the available middleware services
support this.

Figure PC-16. SAGA file management and job submission. The code is dessigned to be independent from the deployed Grid middleware.
Figure PC-16 shows code that first backs up a remote file and then starts a (in this case trivial) job operating
on the copied file, intercepting the standard input and output (console) streams this remote job may use.
Requirements Analysis As outlined above, any grid enabling application toolkit must cope with a number of very
dynamic requirements while providing a "simple" and "easy-to-use" API. The following explains these requirements
in additional detail.
Dynamic Specification Landscape
The Open Grid Forum (OGF)
[17]
is an international standardization body whose primary objective is to define a set of standards in the emerging
field of grid computing. OGF specifications will cover grid architectures, protocols, interfaces, and APIs. However,
the whole field is young and the complexity of grids is not yet completely understood, in terms of academic research or
for industrial and commercial applicability and impact. This fact, along with the complexity of the problem itself, is
causing the grid specification landscape to evolve slowly. There are several significant gaps in the scope of standards
being explored, and it is also generally expected that existing specifications will change. The time needed for grid
standards to stabilize is estimated to be 5 to 10 years, however, the expectation for grid computing to solve real world
problems remains very high, partly due to the initial enthusiasm (or hyping) in the field. This dichotomy creates
frustration for end users in particular since scalability and interoperability are multi-layered problems and difficult to
solve. These observations imply the necessity of an interface abstraction for early adopters to shield grid application
development and deployment from the evolving grid landscape and provide a reasonable migration path to future grid systems.
Thus, any SAGA implementation must include mechanisms for coping with evolving grid standards and changing grid environments.
Evolving Grid Specifications
The SAGA specification itself is currently limited and expected to expand in scope over time. In particular, it is expected
that new SAGA extensions will be required to provide programming paradigms for emerging grid standards to the
application developers. The general look and feel of the SAGA specification, however, is thought to be more stable and
that extensions will be merely semantic (new objects, new method calls) but with limited or no syntactical additions
(no change to the object or task models, for instance). Any given SAGA implementation must be able to cope with
future SAGA extensions easily, without breaking support and backward compatibility for early SAGA adopters and applications.
Dynamic Grid Environment
As grid middleware evolves, deployed grid environments face constant changes of middleware deployments (e.g.,
new versions and services are rolled out frequently, often with unclear migration paths). Grid environments are
also dynamic by design, with respect to the availability of services and other resources. Any application designed
to run on grids needs to implement fail safety mechanisms for coping with such changes and not rely on the static
configuration or availability of resources. Much of dynamism, however, can be hidden from the application programmer
through the use of APIs and toolkits. For example, an upgrade in a services protocol version could be handled in the
client libraries communicating to the service and not at the application level. Resource discovery, fail safety on service
failures and simple fallbacks such as redundant service deployments are other examples of mechanisms that are vital to
the successful running of a grid applicationbut should ideally not need to be provided in application code. A SAGA
implementation must allow for and, where possible, actively support fail safety mechanisms, and hide the dynamic
nature of grid resource availability from the application.
Heterogeneous Grid Environment
The dynamism of grid environments is also reflected in their potentially heterogeneous nature. Although many
current grids focus or are heavily based on Linux based clusters, grids conceptually are designed to cope with any
OS (real or virtual) running on any platform. (The predominance of Linux is more an indication of the state of grid
middleware development today than an intentional design.). A SAGA implementation must be portable and platform
independent, both syntactically and semantically.
Distributed Grid Applications
Within the domain of distributed applications, which always imply remote communication, latency considerations play a
major role in application design and implementation. A number of application domains have emerged that cope particularly
well with latencies present in distributed environments, by loosely coupling distributed components or utilizing
latency hiding techniques. Latency hiding techniques (such as caches, bulk operations, and interleaving of
computation and communication) often require application level information (e.g. concurrency information of operations)
to be effective. A library designed for distributed applications must allow these and other latency hiding
techniques to be implemented.
End User Requirements
As previously noted, the SAGA specification was developed based on the responses to a call for use cases from the
grid community and is designed to meet the resulting end user requirements. An API implementation must meet other end
user requirements outside the scope of the actual API specification, such as ease of deployment, ease of configuration,
documentation, and support of multiple language bindings. If any of these properties is missing, acceptance and
utility within the targeted user community can be severely limited.
The SAGA C++ Reference Implementation Thus far, we have covered the motivation and design objectives for a SAGA implementation.
This section will summarize the resulting properties
of the SAGA C++ reference implementation from an end user perspective. The following picture shows the overall architecture of the SAGA implementation.

Figure PC-17. SAGA architecture: A lightweight engine dispatches SAGA calls do dynamically loaded middleware adaptors.
Design Objectives
Although SAGA by definition is intended to be simple for application developers, this doesn't imply that the
implementation itself has to be simple. Logic and functionality built into the SAGA library core provide common
functionality that can be extended through minimal effort. Ideally, adding a new API class is orthogonal to all other
properties of the implementation, and also immediately benefits from those.The library is also designed to be easy to
build, use, and deploy. As described above, a SAGA implementation must cope with a multitude of different dynamic
requirements. A major design objective was to maximize decoupling of different components of the developed library to
provide as much flexibility, adaptability and modularity as possible. The SAGA C++ Reference Implementation was also
designed for maximum portability, anticipating use on different platforms and operating systems.
The SAGA C++ Reference Implementation library is divided into three dimensions, which are described below. These three
dimensions are completely orthogonal — the user of the library may use and combine these freely and develop additional
suitable components usable in tight integration with the provided modules.
Horizontal Extensibility — API Packages
The SAGA specification is object oriented and defines a set of API groups keeping objects of related functionality
together (packages). The SAGA C++ Reference Implementation uses this functional grouping to define API packages. Current
packages are: file management, job management, remote procedure calls, replica management, and data streaming.
Each of these packages constitutes a separate and independent module. These modules depend only on the SAGA engine;
the user is free to use and link only those modules actually needed by the application, minimizing the memory footprint.
New API packages are expected to be added as the SAGA specification evolves. Adding new packages is straightforward
due to the fact that all necessary common operations (such as adaptor loading and selection, or method call routing)
are imported from the SAGA engine.
Vertical Extensibility — Middleware Bindings
A layered architecture allows for vertical decoupling of the SAGA API from the middleware. Separate adaptors, either
loaded at runtime or pre-bound at link time, dispatch the various API function calls to the appropriate middleware.
Usually there will be a separate set of adaptors for each type of supported middleware. These adaptors implement a
well-defined Capability Provider Interface (CPI) and expose that to the top layer of the library, which makes it
possible to switch adaptors at runtime and switch between different (and even concurrent) middleware services
providing the requested functionality. The top library layer dispatches the API function calls to the corresponding
CPI function. It also contains the SAGA engine module, which implements:
-
core SAGA objects such as session, context, task or task container — these objects are responsible for the SAGA look and feel,
and are needed by all API packages,
-
common functions to load and select matching adaptors, to perform generic call routing from API functions to the selected
adaptor, to provide necessary fall back implementations for the synchronous and asynchronous variants of the API functions
(if these are not supported by the selected adaptor).
The dynamic nature of this layered architecture enables easy future extensions through the addition of new adaptors,
helping to cope with emerging grid standards and new grid middleware.
Extensibility for Optimization and Features
Many features of the engine module are implemented by intercepting, analyzing, managing, and rerouting function calls
between the API packages (where they are issued) and the adaptors (where they are executed and forwarded to the middleware).
To generalize this management layer, a PIMPL (Private Implementation) idiom was chosen, and is rigorously used
throughout the SAGA implementation.
This PIMPL layering allows for a number of additional properties to be transparently implemented, and experimented with,
without any change in the API packages or adaptor layers. These features include:
- generic call routing
- task monitoring and optimization
- security management
- late binding
- fallback on adaptor invocation errors
- latency hiding mechanisms
These features can essentially be decoupled from the API and the adaptors because these properties affect only the IMPL
side of the PIMPL layers. Firstly, the private implementation classes all inherit from the same base class but that base
class is handled in the central engine module, so the engine can automatically cope with new API packages and adaptors.
Secondly, all method calls are also handled generically in the engine, which is loosely coupled to both the API and
adaptor layers. Any changes to the engine, all optimization, latency hiding techniques, monitoring features etc.
can be implemented in the engine generically, and are orthogonal to the API and adaptor extensions. Hence, the
extensibility of the engine represents the third orthogonal axis in the libraries extensibility scheme.
Uniform for Programming Languages
The SAGA API specification is language-independent, however, the goal is to define language bindings that
provide both a language-native look and feel to the API user, and strive for syntactic and semantic similarity
over all SAGA language bindings. One of the consequences of this goal is that the API specification does not use
language specific constructs, for instance C++ templates, which are thought to be too difficult to express
uniformly over many languages. Also, the specification tries to be concise about object state management, and
expresses semantics for shallow and deep copies. The SAGA C++ Reference Implementation follows the SAGA API
specification closely in this respect. It is designed to accommodate wrappers in other languages, to provide the
same semantics, and similar look and feel to other language bindings. A Python wrapper is currently developed and in
alpha status, and there are plans to add similar thin wrappers to provide bindings to C, FORTRAN, Perl, and possibly others.
From another point of view, it is extremely convenient to be able to implement adaptors in different languages. The
Grid Application Toolkit
(GAT,
[15]),
a C-based API predecessor of SAGA, already allows adaptors in different languages, and similar mechanisms may be
implemented to allow Python or C based adaptors as well. In particular, Python based adaptors have been extremely
useful for rapid prototyping of middleware bindings for GAT.
Generic with Respect to Middleware, and Adaptable to Dynamic Environments
The dynamism of grid middleware has already been mentioned as a central dominating property of grid environments. This is
addressed in the SAGA C++ Reference Implementation by the described adaptor mechanism that binds to diverse middleware.
Additionally, late binding, fall back mechanisms, and flexible adaptor selection allow for additional resilience against a
dynamic and evolving run time environment. It is noted, however, that adaptors need to deploy mechanisms like resource
discovery and to implement fully asynchronous operations, if the complete software stack is to be able to cope with
dynamic grids. SAGA implementation usability can be severely impacted if the quality of adaptors undermines the libraries
mechanisms.
Modularity makes the Implementation Extensible
We have described how the SAGA C++ Reference Implementation will be able to cope with the expected evolution and
extension of the SAGA API. Further, the adaptor mechanism allows for easy extensions of the library to provide
additional middleware bindings. In fact, the major future work for this SAGA implementation will be to provide multiple
sets of stable adaptors for the major grid environments. This task, however, requires considerably more effort than the
implementation of the present library and it is hoped that grid middleware vendors will be motivated to support and
maintain these adaptors. Ideally, middleware vendors will implement adaptors for SAGA and deliver them as part of their
client side software stack in the same way that they provide MPI implementations. This would be a major step towards wide
spread adoption and benefit to grid applications.
Portability and Scalability
Heterogeneous distributed systems naturally require portable code bases. The SAGA C++ Reference Implementation library
strictly adheres to the C++ standard and portable libraries. To further insure compatibility, the library is developed on
Windows and Linux concurrently as the two major target platforms. Problems on other platforms are also not expected,
however, it should be noted that the portability of the SAGA implementation depends on the portability of the adaptors, and
thus on the portability of the grid middleware client interfaces, which can be a much greater problem when compared to the
library code itself.
Distributed applications are often sensitive to scalability issues too, particularly with respect to remote communications.
As SAGA introduces a number of communication mechanisms, scalability concerns are naturally raised in respect to SAGA
implementations. First, the SAGA API is not targeting high performance communication schemes, but tries to utilize simple
communication paradigms. In no sense, does SAGA intend to replace MPI or other distributed communication libraries.
Having said that, the design allows for zero-copy implementations of the SAGA communication APIs and also for fast
asynchronous notification on events. Both of these are deemed critical for implementing scalable distributed applications.
Simplicity for the End User
SAGA is designed to be simple to use. However, simplicity in use of an API is not only determined by the API specification,
but also by its implementation. Characteristics that need attention while implementing the SAGA API include simple
deployment and configuration, resilience against lower level failures, adaptability to diverse environments, stability,
correctness, and peaceful coexistence with other programming paradigms, tools and libraries.. It is a challenge to
keep a library implementation simple, with readable code but a modular approach helps. For example, it is simple to
hide the generic call routing or the adaptor selection in the engine module since these features are not usually
exposed to the user or adaptor programmer. However, modeling these central properties as modules can significantly
increase the readability and maintainability of the code. Due to its notion of asynchronous operations, or tasks, the
SAGA API implicitly introduces a concurrent programming model, The C++ language binding of the API allows for
combination of that model with arbitrary mechanisms for managing concurrent program elements (i.e. to ensure object
state consistency in all circumstances, to ensure thread safety, and to allow for application level semaphores and mutexes).
More information about the SAGA C++ Reference Implementation (currently being developed at the Center for Computation and Technology at the Louisiana State University) and various aspects of grid enabling toolkits is available on the
SAGA implementation home page
[20].
There you also will find additional information with regard to different aspects of grid enabling toolkits.
Programming examples
As with any complex task, of course, the best way to learn is by doing and, in conjunction with that, examining how
others have approached and handled common situations. In future versions of this Cookbook, we hope to include here
several examples of code and scripts to illustrate common programming techniques and some that can also serve as building
blocks to be adapted and customized towards specific applications. Some topics we will be looking to cover include:
- Performing grid operations
- Grid-enabled applications
- OpenMP and MPI
- VO and Experiment implementation examples
If you have expertise in any of these or related areas, or experience in successful grid application programming, and
are willing to contribute explanations, working examples, code snippets, or "hint & tips",
please contact the
co-editor
s to let us know!
Bibliography[1] Globus home page
(http://www.globus.org/) [2] Condor GT 4.0 Pre WS GRAM
(http://www.globus.org/toolkit/docs/4.0/execution/prewsgram/) [3] Unicore home page
(http://www.unicore.org/) [4] SOAP (Simple Object Access Protocol)
(http://www.w3.org/TR/soap/) [5] SDL (Specification and Description Language)
(http://www.sdl-forum.org/) [6] GT 4.0 GridFTP
(http://www.globus.org/toolkit/docs/4.0/data/gridftp/) [7] GRAM (GT 4.0 Pre WS GRAM)
(http://www.globus.org/toolkit/docs/4.0/execution/prewsgram/) [8] W3 (World Wide Web Consortium)
(http://www.w3.org/) [9] WSDL (Web Services Description Language)
(http://www.w3.org/TR/wsdl) [10] WSRF (Web Services Resource Framework) download
(http://www.oasis-open.org/committees/download.php/16654/wsrf-cs-01.zip) [11] SOAP (Simple Object Access Protocol)
(http://www.w3.org/TR/soap/) [12] Condor home page
(http://www.cs.wisc.edu/condor/) [13] Unicode home page
(http://www.unicore.org/) [14] abstraction layer (Wikipedia definition)
(http://en.wikipedia.org/wiki/Abstraction_layer) [15] GAT (Grid Application Toolkit and Testbed)
(http://www.gridlab.org/wp-1) [16] SAGA (Simple API for Grid Apps)
(https://forge.gridforum.org/projects/saga-rg/) [17] OGF (Open Grid Forum)
(http://www.ogf.org) [19] Gridlab
(http://www.gridlab.org) [20] SAGA implementation home page
(http://fortytwo.cct.lsu.edu:8000/SAGA) [21] SAGA C++ reference implementation
(http://www.cct.lsu.edu/projects/Grid+Application+Toolkit) [22] Monitoring and Discovery System
(http://www.globus.org/toolkit/mds/) [23] Grid Laboratory Uniform Environment
(http://forge.gridforum.org/sf/projects/glue-wg) [24] DataTAG
(http://datatag.web.cern.ch/datatag/") [25] GridForgea
(http://forge.gridforum.org/sf/sfmain/do/home) [26] A Globus Primer
(http://www.globus.org/toolkit/docs/4.0/key/GT4_Primer_0.6.pdf) [27] MDS web pages
(http://www.globus.org/mds) [28] Globus Monitoring and Discover (2005 Globus World)
(http://www.globus.org/toolkit/presentations/GlobusWorld_2005_Session_9c.pdf) [29] GridLab
(http://www.gridlab.org/) [30] iGrid
(http://sara.unile.it/%7Ecafaro/software.html) [31] GridShib website
(http://gridshib.globus.org/about.html) [32] Shibboleth
(http://shibboleth.internet2.edu/) [33] Internet2
(http://www.internet2.edu/) [34] GridShib Technical Overview
(http://grid.ncsa.uiuc.edu/presentations/gridshib-tech-overview-apr06.ppt) [35] OpenPBS
(http://www-unix.mcs.anl.gov/openpbs/) [36] LSF
(http://www.platform.com/Products/Platform.LSF.Family/) [37] LoadLeveler
(http://www-128.ibm.com/developerworks/eserver/library/es-loadlevel/index.html) [38] Maui
(http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php) [39] Moab
(http://www.clusterresources.com/pages/products/moab-cluster-suite.php) [40] Globus Resource Allocation Manager
(http://tinyurl.com/2shtao) [41] Torque
(http://www.clusterresources.com/pages/products/torque-resource-manager.php) [42] e-Compute
(http://www.altair.com/software/ecompute.htm) [43] Nordugrid
(http://www.nordugrid.org/) [44] Condor User Tutorial,^KUK Condor Week, ^KNeSC,^KOctober, 2004
(http://www.nesc.ac.uk/talks/438/11th/user_tutorial.ppt) [45] Condor manual
(http://www.cs.wisc.edu/condor/manual/v6.4/) [46] AIST Grid Scheduling System
(http://www.aist.go.jp/aist_e/aist_today/2006_20/hot_line/hot_line_21.html) [47] NAREGI GridVM
(http://tinyurl.com/24nenc) [48] Keahey's Virtual Workspace
(http://workspace.globus.org/papers/) [49] Globus Toolkit GRAM
(http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4045) [50] PBSPro
(http://www.altair.com/software/pbspro.htm) [51] GridFTP
(http://www.globus.org/toolkit/docs/4.0/data/gridftp/) [52] Reliable File Transfer
(http://www.globus.org/toolkit/docs/4.0/data/rft/) [53] GT 4.0 RFT Command Reference
(http://www.globus.org/toolkit/docs/4.0/data/rft/RFT_Commandline_Frag.html) [54] GT 4.0 RLS
(http://www.globus.org/toolkit/docs/4.0/data/rls/) [55] Network Storage Technology
(http://www.cs.wisc.edu/condor/nest/) [56] Flexibility, Manageability, and Performance in a Grid Storage Appliance
(http://www.cs.wisc.edu/condor/nest/papers/nest-hpdc-02.pdf) [57] SRM/DRM
(https://twiki.grid.iu.edu/twiki/bin/view/Integration/SrmDrm) [58] srmcp
(https://twiki.grid.iu.edu/twiki/bin/view/Documentation/StorageSrmcpUsing) [59] Gratia twiki page
(https://twiki.grid.iu.edu/twiki/bin/view/Accounting/WebHome) [60] Full Project Definition
(https://twiki.grid.iu.edu/twiki/bin/viewfile/Accounting/WebHome?filename=AccountingProjectDefinition1.doc) [61] SweGrid Accounting System
(http://www.sgas.se/) [62] SGAS Download
(http://www-unix.globus.org/toolkit/docs/4.0/techpreview/sgas/) [63] SGAS Installation and Administration Guide
(http://www.sgas.se/docs/SGASInstallConfig.pdf) [64] SGAS Administration Guide
(http://www.sgas.se/docs/SGASAdmin.pdf) [65] OSG Gratia Project
(https://twiki.grid.iu.edu/twiki/bin/view/Accounting/WebHome) [66] Directed Acyclic Graph Manager (DAGman)
(http://www.cs.wisc.edu/condor/dagman/) [67] Condor manual
(http://www.cs.wisc.edu/condor/manual/v6.4/) [68] Swift
(http://www.ci.uchicago.edu/swift/index.php) [69] GriPhyN Virtual Data System
(http://www.griphyn.org/news/index.html) [70] A Swift Tutorial
(http://www.ci.uchicago.edu/swift/guides/tutorial.php) [71] The SwiftScript User Guide
(http://www.ci.uchicago.edu/swift/guides/userguide.php#engineconfiguration) [72] Swiftscript Language Reference Manual
(http://www.ci.uchicago.edu/swift/guides/languagespec.php) [73] Planning for Execution in Grids
(http://pegasus.isi.edu/) [74] GriPhyN Virtual Data System Quick Guide
(http://pegasus.isi.edu/docs/QuickGuide.pdf) [75] Security Assertion Markup Language (SAML)
(http://xml.coverpages.org/saml.html) [76] Open Grid Forum Security groups
(http://www.ogf.org/gf/group_info/areasgroups.php?area_id=7) [77] International Grid Trust Federation (IGTF) — Grid;s Policy Management Authority
(http://www.gridpma.org/)
|