Home |  Previous |  Next |  Print |  Contact

 Programming Concepts & Challenges

  
 Acknowledgments
 Preface
 Introduction
 History, Standards & Directions
 What Grids Can Do For You
 Grid Case Studies
 Current Technology for Grids
 Programming Concepts & Challenges
 
 Introduction
 Application interfaces today
 Working with specific grid services
 Grid-enabling application toolkits
 Programming examples
 Bibliography
 Joining a Grid: Procedures & Examples
 Typical Usage Examples
 Related Topics
 My Favorite Tips
 Glossary
 Appendices
 Use of This Material
 

Programming Concepts & Challenges


Introduction

Earlier chapters have focused on the concepts and components that form the basis of grid environments. This chapter will provide additional depth on how these can be leveraged in grid-enabled applications, and the challenges inherent in the programming of such applications.

The advocates of grid computing promise a world where large, shared, scientific research instruments, experimental data, numerical simulations, analysis tools, research and development platforms, as well as people, are closely coordinated and integrated in 'virtual organizations'. Still, relatively few grid-enabled applications exist that exploit the full potential of grid environments. This may be largely attributed to the difficulties faced by application developers in trying to master the complex interplay of the various components, including resource reservation, security, accounting, and communication. Moreover, typical grid middleware (e.g., Globus [1], Condor [2], and Unicore [3]) provide relatively complex programming interfaces and are still in the development phase such that significantly new software releases appear frequently.

Dealing with complex and changing programming interfaces is difficult in and of itself and partially responsible for the fact that few applications have been grid-enabled. An additional aspect of the problem is that we are still learning how applications in general can benefit from running on a grid and the best ways to optimize individual applications to take maximum advantage of the grid environment. Unlike homogeneous parallel machines or clusters, grid environments are heterogeneous and dynamic in nature, and subject to change at various levels:

  • on the hardware level, where the application programmer has to deal with different computer architectures, chipsets, execution speeds and models,
  • on the software level, including different operating systems (and versions), different compilers, inhomogeneous software environments, etc., and
  • on the administrative level, where the programmer faces various and incompatible administrative policies between different grid resources.

The current grid scenario consists of:

  • services (and interfaces) that are upgraded on a regular basis
  • institutions (i.e. resources, services, applications) that join and leave a grid without much notice
  • Changes in the application environment at run time, including services that go down without warning, resources that get busy or become available without notice, and fluctuations in the capacity of available storage.

In grid environments, conditions constantly change and at a far greater rate than situations where activities are being controlled under a single administrative domain. Today's grid middleware allows you to cope with these changes but addressing them in the most effective way can be a major programming effort. In the end, a grid-enabled application requires additional code for handling transient problems, and portions of the application code can require very frequent maintenance.

Since environmental components can change unexpectedly, such changes can easily break applications that rely on a concrete configuration, or invalidate the results of such applications. Furthermore, to run efficiently, grid applications need to be scheduled and then executed in such a manner that the differences of the resources that are actually used are properly taken into account.

The bottom line is that application programmers have to incorporate completely new and complex paradigms into their applications, which requires significant experience and effort due to the steep learning curve. Additionally we have to take into account the fact that the average application programmer is not a grid expert, but typically a domain expert wishing to solve domain-related problems.

Most applications share these problems, but code reuse is very difficult, if not impossible, because of the fundamental differences in the way applications are written and the need to make use of different grid features. Reusing Globus [1] or other middleware-oriented libraries is surely an option, but in the end nearly all grid application programmers gravitate towards creating their own abstraction layers on top of these libraries.

Ideally grid applications would adapt to the changing environment, discover required grid services at run time, and use them as needed, independent of the particular interfaces used by the application programmer. Unfortunately this is not possible in grid environments today, mainly because of the lack of standardized, widely adopted programming interfaces that can hide most of these complexities from the programmer.


Application interfaces today

All the grid services and middleware systems described earlier offer some form of programming interface, encompassing a large variety of technologies. SOAP [4] (Simple Object Access Protocol) services provide a SDL [5] (Specification and Description Language) description. Other services (for instance GridFTP [6]) can be accessed via a well defined protocol, or a client side C API. Yet others feature a rather complex API but with a set of easy-to-use user level tools (for instance GRAM [7]). In general, the diversity of the technologies is very broad but, for each service or concept, there exists an API or programming framework designed to support that particular approach.

The overall picture has improved slightly with the emerging web services technologies. The W3 [8] consortium has defined several standards that provide at least a unified syntactic description of the particular API (via WSDL [9] and WSRF [10]) and the standardization of SOAP [5] provides a unified transport layer for these. But even if web services solve some of the diverse problems and are helping to establish a common service infrastructure, they do not solve the problem of having many different service API's for similar purposes. Thus, learning (and teaching!) programming concepts for grid technology involves understanding a number of different frameworks and APIs.

The dominant APIs and technologies today are Globus [1], Condor [12], Unicode [13] and WSDL/WSRF based services.


Working with specific grid services

Though the grid service landscape may appear diverse at first glance, many concepts and patterns are repeated or heavily complement or overlap one another. For instance, job submission is almost always rendered in some form of (1) describing the application, (2) describing the resources to use, and (3) submitting these descriptions to an execution service that has the required executable running on a target machine. Some of the common concepts are:

  • Security: All grid-related activities need to comply with certain security needs of the application users. Making sure confidential information stays accessible only to correctly identified and authorized persons is crucial to modern computation and even more so for grid related applications
  • File handling: This includes (1) handling of files as a whole (copying, moving, deleting of files), (2) accessing the file content (reading and writing) and (3) querying for different file characteristics such as name, size, and details on last access. The concepts of file handling are very well understood in current operating systems and are generally reused in the grid context.
  • Replica handling: This pertains to the handling of file replicas, which are created to provide additional reliability or scalability. The service maintains and provides access to mapping information from a logical (arbitrary) file name to a target (actual) file name. The target file name typically represents the physical location of the data. This allows for the abstraction and separation of an application's execution from concrete names in a local file system. Additionally it provides means of providing data replication management in grids
  • Information services: This provides for both the discovery and monitoring of grid resources (compute, network, storage, etc.) and services, including information on what services may be available from the different resources, and the state of resources or services at a point in time.
  • Inter-process communication: This concept covers the information exchange between separate jobs generally running on different resources (remote procedure calls (RPC), monitoring and notification, data transfer, etc.). Most of this is well understood in established operating systems and programming languages, and generally reused in the grid context.
  • Workflow management: This concept defines how an application flow or business process may be automated, in whole or part. A workflow includes the documents, information or tasks that are passed from one participant (or component) to another for action, according to a set of procedural rules and dependencies.

Additional detail and examples are provided later in this section.

As often seen in computer technologies, a level of indirection, or an abstraction layer [14], can be used to resolve some problems. The definition of a general grid API can expose common paradigms and can help shield the application programmer from unwanted dynamics and technological details that arise from having a multitude of implementations.

Many of the grid toolkits used today try to present a generic API that can provide a useful level of functionality to support a variety of applications and use cases. Others (i.e. Globus, Condor) are left open for extensions, which provides for flexibility over time but can introduce interoperability problems if compatibility is not maintained across version changes.

In addition to the described programming API, many toolkits provide stand-alone tools that are usable for task execution. The tool-based approach seems to be more stable, since it can manage API interoperability issues rather than leave these to the programmer. For instance, the globus-job-submit tool has not changed much through progressive Globus versions since implementation changes were handled by the toolkit itself.

While APIs and toolkits are easing the development of applications to interact with specific grid services, there is still a high degree of incompatibility in the running of applications and commands across different grid environments (e.g., Globus and Unicore). A uniform approach to grid APIs — or the development of a single grid API, could reduce the variation experienced across current technologies and bring the focus instead to programming techniques and advantages for running on a grid. Advantages of this approach are obvious:

  • users and programmers will have a lower learning curve because they can focus on concepts only, not on concrete implementation techniques
  • a great amount of uniformity will be provided with regard to different grid middleware toolkits that have similar functionality and concepts.

The uniform API approach does have a limitation in that it is a generalization over existing API's and so would hide details and specifics. Nevertheless, experience with existing grid-enabling toolkits such as the Grid Application Toolkit (GAT [15]) or the Simple API for Grid Applications SAGA [17] standardization effort at the Open Grid Forum (OGF [18]) shows that high level, application oriented programming interfaces provide a sensible way of tackling the above mentioned problems in today's grid application landscape. This approach promises easier development and adapting of applications to run in grid environments, and the possibility of building a common and widely available grid API.

Most importantly, the key objective for a grid application interface is simplicity for the application programmer. It should be easy to use and also easy to install, administer and maintain. Remember: an applications programmer is most often a "typical" domain scientist — a physicist, chemist, biologist, linguist, or similar.


Access to information about resources - Information services

Whether you are a grid administrator or a grid user, having access to up-to-date information about the status of the grid is critical since network connections may be unreliable, resources within a vast and distributed collection may come and go, and virtual organizations can be dynamic. The grid users need to be able to determine which resources on the grid are relevant to their application requirements and available at any given time. The grid administrators must be able to monitor the "health" of the grid under their watch and make certain details of the grid available to the grid users. Standardized grid information services collect a great deal of (even customized) information about the grid, provide ways to query against that data, and then present the results in associated tables.

The Globus Toolkit Information Service Monitoring and Discovery System [22] (MDS) is probably one of the most well-known information services. The MDS system provides the capability to monitor and discover what resources are on the grid, and report status about resources as they are being used. For example, you may want to discover what computers are available, what the processor architectures are in each computer, what schedulers are in use, and what sort of load (compute, memory, disk, and so forth) is on each computer. Likewise you may need to monitor the resources on the grid to observe your job running and make sure it isn't experiencing any problems. Resource properties appropriate to specific monitoring and discovery needs can be defined via services such as GRAM, RFT, GridFTP, and RLS.

MDS collects information across multiple, distributed resources on a grid via aggregator services that collect real-time (or fairly recent) state information from registered information sources into an index. Collections of information can be queried Through various interfaces (browser, command line, and Web services.) The most recent version of MDS, MDS v4, uses XML and Web services interfaces to register the sources and locate and access information. This framework includes 1) explicit registration of the information source with the aggregator service, 2) expiration (automatic cleaning out) of registrations not renewed periodically, 3) collection by aggregator of up-to-date information from all registered information sources, and 4) support for query and publication of results.

As mentioned before, the "MDS-Index" service collects information on various Globus services and other protocol specific sources and then makes the data available in XML-based properties that can then be queried and published with standardized access methods. The data is published according to a schema that has been defined by the author/administrator or, in the case of a multi-institutional distributed grid, the collaborators. An example of the later is a schema called Grid Laboratory Uniform Environment [23] (GLUE). GLUE was developed by DataTAG [24] for interoperability between European and US Grids and is now under the Glue Working Group in the Open Grid Forum at GridForge [25].)

There are three ways of viewing MDS data: 1) Write your own application (in C, Java, Python, or .NET) using the standard Web services interface. 2) Use the command line tool wsrf-get-property. 3) Use the WebMDS tool, which is highly configurable, to view data using a standard web browser.


Figure PC-1. A view of grid information ala MDS.

MDS also includes a trigger service that allows definition of rules for actions on the data (such as to whom email should be sent and what should be sent to them) and core Web services security to handle issues like who can access the indexes and data.

See A Globus Primer [26], the MDS web pages [27], and Globus Monitoring and Discover (2005 Globus World) [28] for more details about and features within MDS.

The European GridLab [29] project, in collaboration with US researchers at LSU, has developed an information service called iGrid [30]. The iGrid distributed architecture is based on two kind of Information Services, iServe and iStore GSI-enabled web services. The iServe services supplies information about a specific resource, while the iStore services aggregates information coming from registered iServe. iGrid is based on a relational DBMS and utilizes an efficient information caching policy. It can handle information extracted directly from the computational resource, where the server is running, and also user-supplied information. Thus iGrid has both system information providers and user information providers. The system provides information in XML format, while the user provides information via a web service registration method. The web service itself is based on the gSOAP toolkit, the GSI plug-in for gSOAP and the GrelC library. A push model is used to supply information to iStore from iServe services.


Job submission and management

Two services are typically included in a computing system for processing jobs -- a job manager and a job scheduler. Sometimes these functions are handled by separate tools. In other cases one tool may have components that serve both functions. In this section we will give you a description of each service and some examples of software that performs these functions on a grid.

A job manager enables the site or grid administrator to define and enforce procedures and policies for running jobs on a resource based on a wide range of properties such as computing system or type, user groups, priorities, run time, queue types and lengths, and so forth. The job manager also provides the end user with methods for submitting, monitoring, and controlling jobs. In some cases the end user can define policies within his or her own collection of jobs. On a grid, local resource/job managers communicate with a global resource manager in order to provide status information to all administrators and users across the grid.

The job scheduler matches the job with the appropriate resources according to the requirements specified by the user. The requirements can include items like cpu type and number, run and/or wall time, memory and/or disk, restarts, checkpoints, and so forth. (And, in some cases, the job manager or scheduler can remove a job if the job requirements have been incorrectly specified.)

Job schedulers include products like PBSPro [50], and OpenPBS [35], LSF [36], LoadLeveler [37], Maui [38], Moab [39], and Globus Resource Allocation Manager [40] (GRAM). Job management is also included in PBSPro and OpenPBS, GRAM, LSF, and Torque [41].

For example, PBSPro (from Altair Engineering) includes user commands such as qsub (submit job), qstat (check status of machine, queues, jobs), and qdel (delete job) for user management of jobs. A simple PBSPro job submission file would look something like

#PBS -N Strato-ozone
#PBS -l ncpus=128
#PBS -q flicker
#PBS -k oe
#PBS -m abe
cd ~/ozone
mpirun -np 128 transform

In this case:

  • The job name is Strato-ozone (-N).
  • The job requires 128 processors (ncpus) and is looking for a queue named flicker (-q).
  • Standard standard output (o) and standard error (e) files should be kept (-k).
  • The job owner wants email (-m) to be sent when the job begins (b), ends (e), or aborts (a).
  • The job is in the owner's directory named ozone.
  • The executable is named transform and has been developed as an MPI application (mpirun).

PBS also includes an X-Windowed interface, called xPBS. A job submission dialog interface can be used along with an interface where you can monitor hosts, queues, and jobs.


Figure PC-3. XPBS job submission interface.


Figure PC-4. XPBS server, queue, and job information interface.

Altair Engineering also offers a browser portal called e-Compute [42] that works with PBSPro. Likewise, PBSPro provides for command line and windowed interfaces for the administrator to define queues and policies and to monitor the environment and health of the resources being managed.

The Condor project at the University of Wisconsin-Madison provides the ability to join collections of workstations and clusters together into a distributed high-throughput computing facility. Condor is also a resource scheduling system and management system for the collected resources. Condor has mechanisms for matchmaking to select an appropriate computer for a job, checkpointing and migration of jobs for reliability, running parallel jobs, and for running large workflows.

Condor can handle large numbers of jobs plus inter-job dependencies and both user and administrator defined job priorities. Condor jobs run in a number of pre-defined batch "universes", which specify how jobs are to be run (regular job, job with checkpointing, parallel job, etc.). Jobs are described in a scripting fashion similar to PBSPro and then submitted in a batch or background mode. A simple job description file would be:

# Example condor_submit input file
Universe = vanilla
Executable = /home/ozone/condor/transform.condor
Input = transform.stdin
Output = transform.stdout
Error = transform.stderr
Arguments = -arg1 -arg2
InitialDir = /home/ozone/condor/run
Queue

This file is then submitted to the universe via the condor_submit command line. The condor_submit command initiates parsing of the file and creation of a "ClassAd" that describes the job in terms of hardware architecture, operating system, memory, disk, and so forth. This ClassAd is then sent to the scheduler,which stores the job in its queue. Queues can be viewed with the condor_q command.

Condor submit files can describe multiple jobs which then become a "cluster" of jobs when submitted. Each job within a cluster is called a "process". This sort of feature is particularly useful in applications that require simple processing across hundreds of data files.

Condor includes additional commands to remove jobs (condor_rm), temporarily halt (condor_hold) and release (condor_release) a job, see the history of past jobs (condor_history), and specify priority order of your jobs (condor_prio). The Condor JobMonitor provides a viewer for job progress. Scripting options are available to enable email notification, log files, and more.

Condor can also schedule non-Condor resources through the grid-enabled version, Condor-G. In a typical scenario, Condor is layered over Globus to provide a "personal batch system" for the grid.


Figure PC-5. Condor/Condor-G scheduling system.

Condor-G maintains information to provide fault tolerance in case of local or remote crashes or network problems. It also provides a service called "GlideIn" that makes a wide-area grid appear to be a single Condor pool, and allows all of the Condor features, such as matchmaking, checkpointing and remote I/O, to work naturally in a grid environment.

Condor-G can also submit and manage jobs to Nordugrid [43], Oracle Database, Unicore, PBS, LSF, and remote Condor pools.

An excellent tutorial on Condor can be found at Condor User Tutorial, UK Condor Week, NeSC, October, 2004 [44]. The Condor manual [45] is also located at the Condor home page [12].

Advance Reservation

While schedulers and job managers continue to develop and improve, the advent of their use on distributed systems such as grids has caused interest in the concept of "advance reservation". As developing applications require more complex computational capabilities and significantly longer run-times, the ability to assure resources to successfully complete a job is becoming increasingly important.

Noteable approaches to advance reservation include:

  • The AIST Grid Scheduling System [46] (GRS) for co-allocation of computing and networking resources. This approach consists of three components: a computing resource manager, a network resource manager, and a grid resource scheduler that handles requests from users via the other two.
  • The NAREGI GridVM [47] which provides a virtual execution environment and advanced registration of compute nodes.
  • Keahey's Virtual Workspace [48] which is an execution environment in terms of the hardware and software components required. These workspaces can be implemented in a number of ways with advance registration being explored.
  • Globus Toolkit GRAM [49] which allows users to create and manage advance registration by leveraging the control provided by local resource managers.

For example, under GRAM, the reservation is a separate entity with a reservation ID. A grid user can request the reservation of specific resources for a period of time. The reservation has a specified lifetime and multiple jobs can be bound to the reservation throughout this lifetime, by the reservation owner via the reservation ID. A simple image depicting this advance registration process is provided by the Globus Alliance. Figure detail shows a client (user or administrator) creating and managing reservations through an Advanced Registration System (ARS) and Master Job Scheduler (MJS) that communicate through an adapter with the Local Resource Manager (such as PBSPro, LSF, Maui, etc.):


Figure PC-6. Globus advance reservation system.


Data access, movement, and storage

Those interested in grid computing may be looking for increased computational capabilities but very frequently also have a need to process large amounts of data. To insure the movement of data where and when needed in a grid environment, bandwidth between disk, cache, memory, and CPUmust be considered.

A number of services are available to manage data in a grid environment, but they vary quite a bit within the context of different grid projects.

The Globus Toolkit GT4 divides the concept of data management into two categories: data movement and data replication. Data movement is handled by two services.

GridFTP [51]: GridFTP is a protocol defined by the Global Grid Forum. The toolkit provides a server implementation (with Data Storage Interface options for POSIX, SRB, HPSS, and Condor NeST systems), a command line client, and a set of development libraries for custom clients.

The command line client is called globus-url-copy and uses the standard get and put approach of standard ftp. For example,

globus-url-copy -vp -tcp-bs 5551234 -p 4 file:///mydir/mydata gsiftp://faraway.site.org/tmp/mydata

will put the file "mydata" at my local machine to file "mydata" at the /tmp directory on a machine named "faraway" at site.org. Note that globus-rul-copy does not run interactively and should be part of a job script. Alternately, to get a file back, the "file" and "gsiftp" parameters are simply switched in order on the command line. Third party transfers are also supported. In this case, both files appear associated by the "gsiftp" parameter:

globus-url-copy -vp -tcp-bs 5551234 -p 4 gsiftp:///faraway.right.org/mydata gsiftp://faraway.left.org/tmp/mydata

While GridFTP maintains a familiar concept in file transfer, it is not a web service protocol. GridFTP also requires an open socket throughout the transfer, meaning that a failure on either end cannot be recovered, which can be particularly problematic for large file transfers.

Reliable File Transfer [52] (RFT): RFT is a part of the web services framework and therefore provides more functionality in data movement. RFT uses standard SOAP messages over HTTP to submit and manage a set of 3rd party GridFTP transfers and to delete files using GridFTP. By submitting a list of URL pairs, the user can specify which files are to be transferred or deleted. Using this approach, the files are created after the user is properly authorized and authenticated. And since RFT keeps transfer state in a PostgreSQL database, the file transfer is recoverable in case of any failures.

There is currently no GUI interface for RFT and various command line examples can be found at the GT 4.0 RFT Command Reference [53] page.

Data replication is currently handled by the Replica Location Service (RLS). RLS is a simple registry that records where replicas exist on physical storage systems. The users of the system register the replicas and can follow up later with queries to find them. RLS is a distributed registry, making it more scalable and less vulnerable to single-point-failures (though it can be implemented as a centralized registry if preferred.) RLS maintains mappings between a logical file name and the associated physical replicas. Data replicas are very helpful in situations where large collections of data are used frequently by a group of people across distributed resources. For more information, see GT 4.0 RLS [54].

Condor [12] includes a software network called Network Storage Technology [55] (NeST) which negotiates guaranteed storage allocations (or contracts), in terms of "lots", between users and servers for specified periods of time. NeST provides flexibility in terms of size and duration of these lots as well as hierarchies (called sublots) and both user and group access control options. NeST provides multiple interfaces including protocols for HTTP, GSI-FTP (a Globus GridFTP collaboration), NFS, and its own "Chirp". And NeST provides administrators with the ability to define limits and policy as well as the automatic reclamation of storage at the end of the "contract".


Figure PC-7. Condor NeST architecture.

Operations on a Lot include

  • create, delete, and update
  • movefile
  • adduser, remove user
  • attach/detach (binding to specific file or path)

More information on NeST is available in the following paper from the development team: Flexibility, Manageability, and Performance in a Grid Storage Appliance [56].


Reporting grid usage

Gratia and SweGrid Accounting System (SGAS) are examples of grid-wide accounting packages that are capable of meeting the needs of a large-scale grid today.

Gratia

Gratia is in large-scale operation on the Open Science Grid to collect accounting information. The software was developed for the Open Science Grid to meet several system requirements, as documented in the [59]:

"The Grid Accounting Project has:

  • designed the schema for the accounting attributes,
  • is ensuring the necessary collectors and sensors are in place in the resource providers,
  • has defined and is deploying repository and access tools for the reporting and analysis of the grid wide accounting information.

The Accounting system will properly determine a confidence level in the existing accounting information and adequately address and present erroneous or missing accounting data.

The accounting system will adequately protect the privacy of the users and organizations involved.

The auditing system will use information from the accounting system and link it to information from other sources to allow full tracking and analysis of the actions and events related to a user's resource usage.

The auditing system needs to be able to present the immediate and short term information of the state and transitions in a user's use of a resource.

The initial main goal for the accounting system will be to track VO members' resource usage and to present that information in a consistent Grid-wide view, focusing in particular on CPU and Disk Storage utilization".

Data is collected via a standard process, running on each node, which generates daily usage logs containing information on the jobs that ran and how many resources they consumed. This data can be used by Gratia for accounting purposes, and needs to be sent to the Gratia collector to be stored in a reporting database. The purpose of the probe is to read generated files and convert them to usage records that the Gratia program can then send to the Gratia collector.


Figure PC-8. The Gratia architecture.

Gratia collects job counts and wall/cpu time used by a user, for a site, and for a VO.


Figure PC-9. An example Gratia report.

Installation and implementation information as well as the Gratia mailing list may be found at the twiki page. See the Full Project Definition [60] for additional information as well.

SGAS

The SweGrid Accounting System [61] (SGAS) is a Java implementation of a resource allocation enforcement and tracking service based on the latest Web services technologies. SGAS is a soft-state, non-intrusive Grid accounting solution that includes logging and tracking in GGF Usage Record XML format and a remote and scriptable management interface.

SGAS is made up of several components:

  • Bank - the central service of the accounting system that maintains and enforces allocation quotas.
  • Logging and Usage Tracking Service (LUTS) - a general purpose logging system for tracking resource usage in SGAS. It allows secure publication and query-based retrieval of usage data in the format of GGF UsageRecord XML.
  • Job Account Reservation Manager (JARM) - a component responsible for integrating various workload managers, schedulers and local accounting systems deployed at the resource sites with SGAS. JARM is typically used as a callout to the bank during the job submission phase. The bank then issues a time-limited reservation to run the job, based on user, resource and bank policy. After the job has completed the job is logged in LUTS, and if a valid account reservation was made, JARM also charges the account in the Bank, and releases the reservation on behalf of the resource.
  • Policy Administration Tool (PAT) - a component designed to manage the security policies of all of the SGAS services. It contains a command line tool that can be run in interactive or batch mode for easy scripting.


Figure PC-10. The SGAS architecture.

SGAS runs on all platforms supporting JRE 1.5.

The Globus Toolkit (GT 4) includes SGAS as part of the the available download. [62]. The BalticGrid project is using SGAS in conjunction with Globus. Their Virtual User System (VUS) requires few authorization mechanisms (VOMS, gridmap file, banned list, and SGAG) and handles privilege enforcement on several levels (meta scheduler, operation system and local scheduler, and application), with a job/account isolation level. Data is stored in the context of global user identity and VO and data is gathered for VOs as well as resource owners. Their system can be summarized in the following diagram.


Figure PC-11. The Globus and SGAS connection.

See the SGAS Accounting System Installation and Administration Guide [63] and Administration Guide [64] for more information.

Upcoming challenges for these accounting systems include things like understanding the discrepancies between different tools, sometimes obscured by failures in the collection processes. Since the grid computing systems are made up of such diverse processors and architectures, some form of normalization is under investigation. Such tools might include better descriptions of the processors to the accounting system, performance index information, and so forth. Data transport is another area of interest. The OSG Gratia Project [65] is looking at these things.


Workflow processing

According to Wikipedia, "Workflow at its simplest is the movement of documents and/or tasks through a work process. More specifically, workflow is the operational aspect of a work procedure: how tasks are structured, who performs them, what their relative order is, how they are synchronized, how information flows to support the tasks (wordflow) and how tasks are being tracked. As the dimension of time is considered in workflow, workflow considers "throughput" as a distinct measure. Workflow problems can be modeled and analyzed using graph-based formalisms like Petri nets."

Clearly, as computing resources become available in more distributed fashion, as the tools to use them expand, and as our expertise in using them develops, the concept of managing our workflow across this grid of software and hardware resources emerges. In this section we will give several examples of grid products and services that support this concept of workflow management.

Condor DAGman

Condor's Directed Acyclic Graph Manager [66] (DAGman) allows you to specify the dependencies between Condor jobs. (For example, jobs can be ordered chronologically.) DAGman works through a data structure called the "DAG", a dependency graph where each job is a node and can have multiple parents and children providing no loops are created. A DAG is created via a ".dag" file such as:

# ozone.dag
Job A jacobian.sub
Job B vadvection.sub
Job C xadvection.sub
Job D xdiff.sub
Job E thdiff.sub
Job F predict.sub

Parent A Child B C
Parent B C Child D, E

Parent D E Child F

Here we have six condor jobs (Job A through Job F) where each job is specified in its associated condor_submit file (jacobian.sub through predict.sub.) The DAG is put into action via the condor_submit_dag command which runs DAGMan itself as a Condor job to benefit from Condor's reliability mechanisms. (See the "Job submission and management" section above for a brief example of the Condor universe.)

A visualization of this process, from the Condor tutorial job, shows that Parent A starts and, upon completion, child jobs B and C start up.

# diamond.dag
Job A a.sub
Job B b.sub
Job C c.sub
Job D d.sub
Parent A Child B C
Parent B C Child D


Figure PC-12. DAGman workflow diagram.


Figure PC-13. DAGman workflow progress.

The process continues until all jobs complete. In case of failure at any step, DAGman will continue as far as possible and then create a "Rescue File" which holds the current state of the DAG job. Once the problem has been resolved, the rescue file can be used to restore the DAG to its prior state. DAGman will continue in this manner until the entire DAG job completes.

DAGMan has been used to run DAGs of tens of thousands of nodes in production. When running so many nodes on a grid, many failures are likely to occur, therefore DAGMan provides a variety of features to reliably run and scale large DAGs.

See the Condor manual [67] for a complete list of DAGman features.

Swift

Swift [68] is a system that builds on and includes technology previously distributed as the GriPhyN Virtual Data System [69]. It provides for the specification, execution, and management of large-scale science and engineering workflows. It supports applications that execute many tasks coupled by disk-resident datasets, for example, when analyzing large quantities of data or performing parameter studies or ensemble simulations.

Swift is open source software that combines:

  • a simple scripting language to enable the concise, high-level specifications of complex parallel computations, and mappers for accessing diverse data formats in a convenient manner. Simple examples of use can be seen at A Swift Tutorial [70]. Swift also provides visualizations through generation of provenance graphs. Swift scripts can be run locally or on remote systems. The same script files can be used in both cases with modifications made only to a "site catalog" file that is in XML format.
  • an execution engine that can manage the dispatch of many (100,000+) tasks to many (1000+) processors, whether on parallel computers, campus grids, or multi-site grids. The runtime engine is configured through properties. Properties are define at the global, user, and command line levels. Properties include such things as site and transformation catalog locations, IP address of GRAM service, caching algorithm information, provenance graph settings, job clustering information and settings for kickstart information gathering and throttling (setting limits for concurrent activities such as workflow instances, tasks/jobs, file transfers, and so forth).

For more information on swift, see The SwiftScript User Guide [71] and the Swiftscript Language Reference Manual [72].

Pegasus

Planning for Execution in Grids [73] (Pegasus) is a workflow-mapping engine developed and used as part of several NSF ITR projects (GriPhyN, NVO, and SCEC-CME). Pegasus automatically maps high-level workflow descriptions onto distributed infrastructures such as the TeraGrid and Open Science Grid. Pegasus:

  • enables scientists to construct workflows in abstract terms without worrying about the details of the underlying Cyberinfrastructure,
  • provides robustness and reliability through dynamic workflow remapping,
  • automatically manages data generated during workflow execution and capturing their provenance information,
  • is used in a variety of scientific applications ranging from astronomy, biology, earthquake science, gravitational-wave physics and others,
  • is used day-to-day to map complex, large-scale scientific workflows with thousands of tasks processing TeraBytes of data onto the Grid.

Pegasus improves the performance of applications through data reuse to avoid duplication and increase reliability, workflow restructuring to improve resource allocation, and automated task and data transfer scheduling. Pegasus provides reliability through dynamic workflow remapping and DAGman workflow execution. Pegasus uses Condor and Globus middleware for distributed environments and:

  • provides a level of abstraction above gridftp, condor-submit, globus-job-run, etc commands
  • provides automated mapping and execution (via DAGMan) of workflow applications onto distributed resources
  • manages data files, can store and catalog intermediate and final data products
  • improves successful application execution
  • improves application performance
  • provides provenance tracking capabilities
  • supplies client-side tools
  • provides an OSG-aware workflow management tool

Pegasus usage examples are beyond the scope of version 1 of this cookboo but can be found in the GriPhyN Virtual Data System Quick Guide [74].


Security and security integration through authn/authz

Today we are encouraged to use more and more Internet-based services, be they online banking, concert ticket purchases, attractive free software options, instant messaging, blogs and wikis, or grid computing. As these services and their sources proliferate, the more nervous we might (and should) become about whether or not interaction with a service is secure. Available grid technologies provide various levels of and mechanisms for security within the grid environment they support.

As has been discussed already, grids often result when multiple institutions form virtual organizations to accomplish tasks beyond the ability of any single institution. These virtual organizations are structured with different members having various privileges, often based on roles and relying on agreed-upon methods to determine and enforce both roles and privileges. For example, some members might only have the right to develop and run software, while others might serve as community administrators. Through these processes, virtual organizations are better able to maintain the integrity of their resources and data, at least with respect to the VO itself.

The more difficult aspect is to maintain security across grid environments, or for resources that are connected to more than one grid environment. Standards development in several related security areas is taking place within the Open Grid Forum, including user authentication, authorization and firewall issues [76]. In addition, organizations such as the IGTF (International Grid Trust Federation), are working to synchronize policies across grid initiatives in order to develop and maintain a global "trust fabric" that supports scaleable and reliable identification of grid users and resources [77]. As this work is ongoing, research projects are also underway to develop necessary supporting architecture and software. One notable example is the GridShib project [31], an NSF funded project of NCSA and the University of Chicago, to integrate the federated authorization infrastructure of Shibboleth [32] with the Globus Toolkit.

As discussed earlier in the Cookbook, the Globus Toolkit provides security via X.509 credentials. Identity-based authorization is provided via access control lists ("gridmaps") mapping to local identities (Unix logins) and a Community Authorization Service (CAS). The Shibboleth project offers a large base of campus use around the world via a standards-based and open source implementation and a standard vocabulary for describing user attributes. With this, Shibboleth has resulted in a well-developed, federated identity management structure.

From the GridShib website [31], "The goal of GridShib is to allow interoperability between the Globus Toolkit® from the Globus Alliance [1] with Shibboleth [32] from Internet2 [33]. As a result, GridShib enables secure attribute sharing between Grid virtual organizations and higher-educational institutions." GridShib provides attribute-based authorization based on Shibboleth.

In addition, while basic security measures are in place to support virtual organizations, the size and complexity of the virtual organizations they can support are limited by the ability of resource managers to manage the privileges of each user in the virtual organization. To address this scaling issue, using GridShib, virtual organizations can use access control methods based on user attributes instead of identity. As a result, resource managers need not know all of the users in the virtual organization, just their attributes (for example, Data Analyst or Software Developer).

The GridShib project has five basic goals:

  • Integrate X.509 and SAML [75] to provide enhanced Grid Security Infrastructure (GSI).
  • Enable attribute sharing between virtual organizations and higher-educational institutions.
  • Develop and implement profiles to securely share attributes across administrative domains.
  • Investigate attribute-based access policy enforcement for grids.
  • Generalize attribute-based authorization policies in the Globus Toolkit runtime environment.

And GridShib has developed around three use cases: established grid user, new grid user, and portal grid user. See the Technical Overview [34] for more details on current use cases and plug-ins.


Figure PC-2. The GridShib relationship.


Grid-enabling application toolkits


Overview of existing frameworks

There have been several attempts to build grid enabling application toolkits in the past. These toolkits aim to provide client side abstraction layers mainly for grid middleware services and related dynamic 'features' in order to increase both the speed and ease with which grid applications can be deployed. Figure PC-14 illustrates the place of an application within a grid environment and its main interface to the grid (shown in red),which is the focus area for application-oriented grid-enabling toolkits.

The big picture

Figure PC-14. An application in a grid infrastructure.

Experience in the development of different grid enabling application toolkits suggests that a required main feature is that they be easy to use. Ease of use includes:

  • Exposing a simple and consistent API which allows error tracing to be invariant,
  • Making upgrades easy to perform and not reliant on specific versions of grid middleware,
  • Exposing a well defined API that is designed to change rarely and to be upward compatible if changes are required,
  • Supporting implementations that allow dynamic exchange of key elements (possibly at runtime) and provide runtime abstractions,
  • Avoiding refactoring/recoding/recompilation whenever some underlying middleware component may have been changed,
  • Ideally, a grid application should be designed to run reliably locally and on the grid, over time, and in light of differences encountered in the grid environment (e.g. the operating system of various resources, versions of grid middleware, etc.).

Applications should also utilize well-known programming paradigms. For example, a file API should provide expected functionality (namely open, close, read, write, seek) versus the introduction of less-straightforward mechanisms (e.g., asking a discovery middleware service to tell the application the location of a middleware service that can then give the location of the requested file).


Toolkit example: Simple API for Grid Applications (SAGA)

SAGA (Simple API for Grid Applications) has been defined by the Open Grid Forum (OGF) [17] as a high-level API that directly addresses the needs of application developers.The purpose of SAGA is two-fold:

  1. Provide a simple API that can be used with much less effort compared to the vanilla interfaces of existing grid middleware. A guiding principle for achieving this simplicity is the 80-20 rule: serve 80% of the use cases with 20% of the effort needed for serving 100% of all possible requirements and
  2. Provide a standardized, portable, common interface across various grid middleware systems and their versions.

SAGA is a prominent recent API standardization effort that intends to simplify the development of grid-enabled applications, even for scientists having no background in computer science or grid computing. SAGA was heavily influenced by the work undertaken in the Gridlab [19] project, in particular by the Grid Application Toolkit (GAT) [15] — one of the first major attempts to build a high level API to grid services. A public call for use cases produced about 25 different examples that served as input to the SAGA development team.

The following examples of code show typical SAGA use cases and illustrate the intended simplicity of SAGA. The code is based on the SAGA C++ reference implementation [21] currently developed at the Center for Computation and Technology at LSU, Baton Rouge. Note, that the code presented is completely independant from the underlying middleware services that are used.

SAGA Bulk File Copy 

Figure PC-15. Asynchronous bulk file copy in SAGA: a file is copied to a number of remote locations using well-known C++ programming paradigms.

Figure PC-15 above shows how easy it is to copy a set of files to different remote locations. Interestingly enough this code even applies certain optimization techniques, for instance the use of bulk operations if the available middleware services support this.

SAGA File and Job Management

Figure PC-16. SAGA file management and job submission. The code is dessigned to be independent from the deployed Grid middleware.

Figure PC-16 shows code that first backs up a remote file and then starts a (in this case trivial) job operating on the copied file, intercepting the standard input and output (console) streams this remote job may use.


Requirements Analysis

As outlined above, any grid enabling application toolkit must cope with a number of very dynamic requirements while providing a "simple" and "easy-to-use" API. The following explains these requirements in additional detail.

Dynamic Specification Landscape

The Open Grid Forum (OGF) [17] is an international standardization body whose primary objective is to define a set of standards in the emerging field of grid computing. OGF specifications will cover grid architectures, protocols, interfaces, and APIs. However, the whole field is young and the complexity of grids is not yet completely understood, in terms of academic research or for industrial and commercial applicability and impact. This fact, along with the complexity of the problem itself, is causing the grid specification landscape to evolve slowly. There are several significant gaps in the scope of standards being explored, and it is also generally expected that existing specifications will change. The time needed for grid standards to stabilize is estimated to be 5 to 10 years, however, the expectation for grid computing to solve real world problems remains very high, partly due to the initial enthusiasm (or hyping) in the field. This dichotomy creates frustration for end users in particular since scalability and interoperability are multi-layered problems and difficult to solve. These observations imply the necessity of an interface abstraction for early adopters to shield grid application development and deployment from the evolving grid landscape and provide a reasonable migration path to future grid systems. Thus, any SAGA implementation must include mechanisms for coping with evolving grid standards and changing grid environments.

Evolving Grid Specifications

The SAGA specification itself is currently limited and expected to expand in scope over time. In particular, it is expected that new SAGA extensions will be required to provide programming paradigms for emerging grid standards to the application developers. The general look and feel of the SAGA specification, however, is thought to be more stable and that extensions will be merely semantic (new objects, new method calls) but with limited or no syntactical additions (no change to the object or task models, for instance). Any given SAGA implementation must be able to cope with future SAGA extensions easily, without breaking support and backward compatibility for early SAGA adopters and applications.

Dynamic Grid Environment

As grid middleware evolves, deployed grid environments face constant changes of middleware deployments (e.g., new versions and services are rolled out frequently, often with unclear migration paths). Grid environments are also dynamic by design, with respect to the availability of services and other resources. Any application designed to run on grids needs to implement fail safety mechanisms for coping with such changes and not rely on the static configuration or availability of resources. Much of dynamism, however, can be hidden from the application programmer through the use of APIs and toolkits. For example, an upgrade in a services protocol version could be handled in the client libraries communicating to the service and not at the application level. Resource discovery, fail safety on service failures and simple fallbacks such as redundant service deployments are other examples of mechanisms that are vital to the successful running of a grid applicationbut should ideally not need to be provided in application code. A SAGA implementation must allow for and, where possible, actively support fail safety mechanisms, and hide the dynamic nature of grid resource availability from the application.

Heterogeneous Grid Environment

The dynamism of grid environments is also reflected in their potentially heterogeneous nature. Although many current grids focus or are heavily based on Linux based clusters, grids conceptually are designed to cope with any OS (real or virtual) running on any platform. (The predominance of Linux is more an indication of the state of grid middleware development today than an intentional design.). A SAGA implementation must be portable and platform independent, both syntactically and semantically.

Distributed Grid Applications

Within the domain of distributed applications, which always imply remote communication, latency considerations play a major role in application design and implementation. A number of application domains have emerged that cope particularly well with latencies present in distributed environments, by loosely coupling distributed components or utilizing latency hiding techniques. Latency hiding techniques (such as caches, bulk operations, and interleaving of computation and communication) often require application level information (e.g. concurrency information of operations) to be effective. A library designed for distributed applications must allow these and other latency hiding techniques to be implemented.

End User Requirements

As previously noted, the SAGA specification was developed based on the responses to a call for use cases from the grid community and is designed to meet the resulting end user requirements. An API implementation must meet other end user requirements outside the scope of the actual API specification, such as ease of deployment, ease of configuration, documentation, and support of multiple language bindings. If any of these properties is missing, acceptance and utility within the targeted user community can be severely limited.


The SAGA C++ Reference Implementation

Thus far, we have covered the motivation and design objectives for a SAGA implementation. This section will summarize the resulting properties of the SAGA C++ reference implementation from an end user perspective. The following picture shows the overall architecture of the SAGA implementation.

Figure PC-17. SAGA architecture: A lightweight engine dispatches SAGA calls do dynamically loaded middleware adaptors.

Design Objectives

Although SAGA by definition is intended to be simple for application developers, this doesn't imply that the implementation itself has to be simple. Logic and functionality built into the SAGA library core provide common functionality that can be extended through minimal effort. Ideally, adding a new API class is orthogonal to all other properties of the implementation, and also immediately benefits from those.The library is also designed to be easy to build, use, and deploy. As described above, a SAGA implementation must cope with a multitude of different dynamic requirements. A major design objective was to maximize decoupling of different components of the developed library to provide as much flexibility, adaptability and modularity as possible. The SAGA C++ Reference Implementation was also designed for maximum portability, anticipating use on different platforms and operating systems.

The SAGA C++ Reference Implementation library is divided into three dimensions, which are described below. These three dimensions are completely orthogonal — the user of the library may use and combine these freely and develop additional suitable components usable in tight integration with the provided modules.

Horizontal Extensibility — API Packages

The SAGA specification is object oriented and defines a set of API groups keeping objects of related functionality together (packages). The SAGA C++ Reference Implementation uses this functional grouping to define API packages. Current packages are: file management, job management, remote procedure calls, replica management, and data streaming. Each of these packages constitutes a separate and independent module. These modules depend only on the SAGA engine; the user is free to use and link only those modules actually needed by the application, minimizing the memory footprint. New API packages are expected to be added as the SAGA specification evolves. Adding new packages is straightforward due to the fact that all necessary common operations (such as adaptor loading and selection, or method call routing) are imported from the SAGA engine.

Vertical Extensibility — Middleware Bindings

A layered architecture allows for vertical decoupling of the SAGA API from the middleware. Separate adaptors, either loaded at runtime or pre-bound at link time, dispatch the various API function calls to the appropriate middleware. Usually there will be a separate set of adaptors for each type of supported middleware. These adaptors implement a well-defined Capability Provider Interface (CPI) and expose that to the top layer of the library, which makes it possible to switch adaptors at runtime and switch between different (and even concurrent) middleware services providing the requested functionality. The top library layer dispatches the API function calls to the corresponding CPI function. It also contains the SAGA engine module, which implements:

  • core SAGA objects such as session, context, task or task container — these objects are responsible for the SAGA look and feel, and are needed by all API packages,
  • common functions to load and select matching adaptors, to perform generic call routing from API functions to the selected adaptor, to provide necessary fall back implementations for the synchronous and asynchronous variants of the API functions (if these are not supported by the selected adaptor).
The dynamic nature of this layered architecture enables easy future extensions through the addition of new adaptors, helping to cope with emerging grid standards and new grid middleware.

Extensibility for Optimization and Features

Many features of the engine module are implemented by intercepting, analyzing, managing, and rerouting function calls between the API packages (where they are issued) and the adaptors (where they are executed and forwarded to the middleware). To generalize this management layer, a PIMPL (Private Implementation) idiom was chosen, and is rigorously used throughout the SAGA implementation.

This PIMPL layering allows for a number of additional properties to be transparently implemented, and experimented with, without any change in the API packages or adaptor layers. These features include:

  • generic call routing
  • task monitoring and optimization
  • security management
  • late binding
  • fallback on adaptor invocation errors
  • latency hiding mechanisms

These features can essentially be decoupled from the API and the adaptors because these properties affect only the IMPL side of the PIMPL layers. Firstly, the private implementation classes all inherit from the same base class but that base class is handled in the central engine module, so the engine can automatically cope with new API packages and adaptors. Secondly, all method calls are also handled generically in the engine, which is loosely coupled to both the API and adaptor layers. Any changes to the engine, all optimization, latency hiding techniques, monitoring features etc. can be implemented in the engine generically, and are orthogonal to the API and adaptor extensions. Hence, the extensibility of the engine represents the third orthogonal axis in the libraries extensibility scheme.

Uniform for Programming Languages

The SAGA API specification is language-independent, however, the goal is to define language bindings that provide both a language-native look and feel to the API user, and strive for syntactic and semantic similarity over all SAGA language bindings. One of the consequences of this goal is that the API specification does not use language specific constructs, for instance C++ templates, which are thought to be too difficult to express uniformly over many languages. Also, the specification tries to be concise about object state management, and expresses semantics for shallow and deep copies. The SAGA C++ Reference Implementation follows the SAGA API specification closely in this respect. It is designed to accommodate wrappers in other languages, to provide the same semantics, and similar look and feel to other language bindings. A Python wrapper is currently developed and in alpha status, and there are plans to add similar thin wrappers to provide bindings to C, FORTRAN, Perl, and possibly others. From another point of view, it is extremely convenient to be able to implement adaptors in different languages. The Grid Application Toolkit (GAT, [15]), a C-based API predecessor of SAGA, already allows adaptors in different languages, and similar mechanisms may be implemented to allow Python or C based adaptors as well. In particular, Python based adaptors have been extremely useful for rapid prototyping of middleware bindings for GAT.

Generic with Respect to Middleware, and Adaptable to Dynamic Environments

The dynamism of grid middleware has already been mentioned as a central dominating property of grid environments. This is addressed in the SAGA C++ Reference Implementation by the described adaptor mechanism that binds to diverse middleware. Additionally, late binding, fall back mechanisms, and flexible adaptor selection allow for additional resilience against a dynamic and evolving run time environment. It is noted, however, that adaptors need to deploy mechanisms like resource discovery and to implement fully asynchronous operations, if the complete software stack is to be able to cope with dynamic grids. SAGA implementation usability can be severely impacted if the quality of adaptors undermines the libraries mechanisms.

Modularity makes the Implementation Extensible

We have described how the SAGA C++ Reference Implementation will be able to cope with the expected evolution and extension of the SAGA API. Further, the adaptor mechanism allows for easy extensions of the library to provide additional middleware bindings. In fact, the major future work for this SAGA implementation will be to provide multiple sets of stable adaptors for the major grid environments. This task, however, requires considerably more effort than the implementation of the present library and it is hoped that grid middleware vendors will be motivated to support and maintain these adaptors. Ideally, middleware vendors will implement adaptors for SAGA and deliver them as part of their client side software stack in the same way that they provide MPI implementations. This would be a major step towards wide spread adoption and benefit to grid applications.

Portability and Scalability

Heterogeneous distributed systems naturally require portable code bases. The SAGA C++ Reference Implementation library strictly adheres to the C++ standard and portable libraries. To further insure compatibility, the library is developed on Windows and Linux concurrently as the two major target platforms. Problems on other platforms are also not expected, however, it should be noted that the portability of the SAGA implementation depends on the portability of the adaptors, and thus on the portability of the grid middleware client interfaces, which can be a much greater problem when compared to the library code itself.

Distributed applications are often sensitive to scalability issues too, particularly with respect to remote communications. As SAGA introduces a number of communication mechanisms, scalability concerns are naturally raised in respect to SAGA implementations. First, the SAGA API is not targeting high performance communication schemes, but tries to utilize simple communication paradigms. In no sense, does SAGA intend to replace MPI or other distributed communication libraries. Having said that, the design allows for zero-copy implementations of the SAGA communication APIs and also for fast asynchronous notification on events. Both of these are deemed critical for implementing scalable distributed applications.

Simplicity for the End User

SAGA is designed to be simple to use. However, simplicity in use of an API is not only determined by the API specification, but also by its implementation. Characteristics that need attention while implementing the SAGA API include simple deployment and configuration, resilience against lower level failures, adaptability to diverse environments, stability, correctness, and peaceful coexistence with other programming paradigms, tools and libraries.. It is a challenge to keep a library implementation simple, with readable code but a modular approach helps. For example, it is simple to hide the generic call routing or the adaptor selection in the engine module since these features are not usually exposed to the user or adaptor programmer. However, modeling these central properties as modules can significantly increase the readability and maintainability of the code. Due to its notion of asynchronous operations, or tasks, the SAGA API implicitly introduces a concurrent programming model, The C++ language binding of the API allows for combination of that model with arbitrary mechanisms for managing concurrent program elements (i.e. to ensure object state consistency in all circumstances, to ensure thread safety, and to allow for application level semaphores and mutexes).

More information about the SAGA C++ Reference Implementation (currently being developed at the Center for Computation and Technology at the Louisiana State University) and various aspects of grid enabling toolkits is available on the SAGA implementation home page [20]. There you also will find additional information with regard to different aspects of grid enabling toolkits.


Programming examples

As with any complex task, of course, the best way to learn is by doing and, in conjunction with that, examining how others have approached and handled common situations. In future versions of this Cookbook, we hope to include here several examples of code and scripts to illustrate common programming techniques and some that can also serve as building blocks to be adapted and customized towards specific applications. Some topics we will be looking to cover include:

  • Performing grid operations
  • Grid-enabled applications
  • OpenMP and MPI
  • VO and Experiment implementation examples

If you have expertise in any of these or related areas, or experience in successful grid application programming, and are willing to contribute explanations, working examples, code snippets, or "hint & tips", please contact the co-editor s to let us know!


Bibliography

[1] Globus home page (http://www.globus.org/)
[2] Condor GT 4.0 Pre WS GRAM (http://www.globus.org/toolkit/docs/4.0/execution/prewsgram/)
[3] Unicore home page (http://www.unicore.org/)
[4] SOAP (Simple Object Access Protocol) (http://www.w3.org/TR/soap/)
[5] SDL (Specification and Description Language) (http://www.sdl-forum.org/)
[6] GT 4.0 GridFTP (http://www.globus.org/toolkit/docs/4.0/data/gridftp/)
[7] GRAM (GT 4.0 Pre WS GRAM) (http://www.globus.org/toolkit/docs/4.0/execution/prewsgram/)
[8] W3 (World Wide Web Consortium) (http://www.w3.org/)
[9] WSDL (Web Services Description Language) (http://www.w3.org/TR/wsdl)
[10] WSRF (Web Services Resource Framework) download (http://www.oasis-open.org/committees/download.php/16654/wsrf-cs-01.zip)
[11] SOAP (Simple Object Access Protocol) (http://www.w3.org/TR/soap/)
[12] Condor home page (http://www.cs.wisc.edu/condor/)
[13] Unicode home page (http://www.unicore.org/)
[14] abstraction layer (Wikipedia definition) (http://en.wikipedia.org/wiki/Abstraction_layer)
[15] GAT (Grid Application Toolkit and Testbed) (http://www.gridlab.org/wp-1)
[16] SAGA (Simple API for Grid Apps) (https://forge.gridforum.org/projects/saga-rg/)
[17] OGF (Open Grid Forum) (http://www.ogf.org)
[19] Gridlab (http://www.gridlab.org)
[20] SAGA implementation home page (http://fortytwo.cct.lsu.edu:8000/SAGA)
[21] SAGA C++ reference implementation (http://www.cct.lsu.edu/projects/Grid+Application+Toolkit)
[22] Monitoring and Discovery System (http://www.globus.org/toolkit/mds/)
[23] Grid Laboratory Uniform Environment (http://forge.gridforum.org/sf/projects/glue-wg)
[24] DataTAG (http://datatag.web.cern.ch/datatag/")
[25] GridForgea (http://forge.gridforum.org/sf/sfmain/do/home)
[26] A Globus Primer (http://www.globus.org/toolkit/docs/4.0/key/GT4_Primer_0.6.pdf)
[27] MDS web pages (http://www.globus.org/mds)
[28] Globus Monitoring and Discover (2005 Globus World) (http://www.globus.org/toolkit/presentations/GlobusWorld_2005_Session_9c.pdf)
[29] GridLab (http://www.gridlab.org/)
[30] iGrid (http://sara.unile.it/%7Ecafaro/software.html)
[31] GridShib website (http://gridshib.globus.org/about.html)
[32] Shibboleth (http://shibboleth.internet2.edu/)
[33] Internet2 (http://www.internet2.edu/)
[34] GridShib Technical Overview (http://grid.ncsa.uiuc.edu/presentations/gridshib-tech-overview-apr06.ppt)
[35] OpenPBS (http://www-unix.mcs.anl.gov/openpbs/)
[36] LSF (http://www.platform.com/Products/Platform.LSF.Family/)
[37] LoadLeveler (http://www-128.ibm.com/developerworks/eserver/library/es-loadlevel/index.html)
[38] Maui (http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php)
[39] Moab (http://www.clusterresources.com/pages/products/moab-cluster-suite.php)
[40] Globus Resource Allocation Manager (http://tinyurl.com/2shtao)
[41] Torque (http://www.clusterresources.com/pages/products/torque-resource-manager.php)
[42] e-Compute (http://www.altair.com/software/ecompute.htm)
[43] Nordugrid (http://www.nordugrid.org/)
[44] Condor User Tutorial,^KUK Condor Week, ^KNeSC,^KOctober, 2004 (http://www.nesc.ac.uk/talks/438/11th/user_tutorial.ppt)
[45] Condor manual (http://www.cs.wisc.edu/condor/manual/v6.4/)
[46] AIST Grid Scheduling System (http://www.aist.go.jp/aist_e/aist_today/2006_20/hot_line/hot_line_21.html)
[47] NAREGI GridVM (http://tinyurl.com/24nenc)
[48] Keahey's Virtual Workspace (http://workspace.globus.org/papers/)
[49] Globus Toolkit GRAM (http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4045)
[50] PBSPro (http://www.altair.com/software/pbspro.htm)
[51] GridFTP (http://www.globus.org/toolkit/docs/4.0/data/gridftp/)
[52] Reliable File Transfer (http://www.globus.org/toolkit/docs/4.0/data/rft/)
[53] GT 4.0 RFT Command Reference (http://www.globus.org/toolkit/docs/4.0/data/rft/RFT_Commandline_Frag.html)
[54] GT 4.0 RLS (http://www.globus.org/toolkit/docs/4.0/data/rls/)
[55] Network Storage Technology (http://www.cs.wisc.edu/condor/nest/)
[56] Flexibility, Manageability, and Performance in a Grid Storage Appliance (http://www.cs.wisc.edu/condor/nest/papers/nest-hpdc-02.pdf)
[57] SRM/DRM (https://twiki.grid.iu.edu/twiki/bin/view/Integration/SrmDrm)
[58] srmcp (https://twiki.grid.iu.edu/twiki/bin/view/Documentation/StorageSrmcpUsing)
[59] Gratia twiki page (https://twiki.grid.iu.edu/twiki/bin/view/Accounting/WebHome)
[60] Full Project Definition (https://twiki.grid.iu.edu/twiki/bin/viewfile/Accounting/WebHome?filename=AccountingProjectDefinition1.doc)
[61] SweGrid Accounting System (http://www.sgas.se/)
[62] SGAS Download (http://www-unix.globus.org/toolkit/docs/4.0/techpreview/sgas/)
[63] SGAS Installation and Administration Guide (http://www.sgas.se/docs/SGASInstallConfig.pdf)
[64] SGAS Administration Guide (http://www.sgas.se/docs/SGASAdmin.pdf)
[65] OSG Gratia Project (https://twiki.grid.iu.edu/twiki/bin/view/Accounting/WebHome)
[66] Directed Acyclic Graph Manager (DAGman) (http://www.cs.wisc.edu/condor/dagman/)
[67] Condor manual (http://www.cs.wisc.edu/condor/manual/v6.4/)
[68] Swift (http://www.ci.uchicago.edu/swift/index.php)
[69] GriPhyN Virtual Data System (http://www.griphyn.org/news/index.html)
[70] A Swift Tutorial (http://www.ci.uchicago.edu/swift/guides/tutorial.php)
[71] The SwiftScript User Guide (http://www.ci.uchicago.edu/swift/guides/userguide.php#engineconfiguration)
[72] Swiftscript Language Reference Manual (http://www.ci.uchicago.edu/swift/guides/languagespec.php)
[73] Planning for Execution in Grids (http://pegasus.isi.edu/)
[74] GriPhyN Virtual Data System Quick Guide (http://pegasus.isi.edu/docs/QuickGuide.pdf)
[75] Security Assertion Markup Language (SAML) (http://xml.coverpages.org/saml.html)
[76] Open Grid Forum Security groups (http://www.ogf.org/gf/group_info/areasgroups.php?area_id=7)
[77] International Grid Trust Federation (IGTF) — Grid;s Policy Management Authority (http://www.gridpma.org/)

© 2006-8, Southeastern Universities Research Association
Sponsored by SURA, TATRC (No. W81XWH-06-1-0419), OSG, and iVDGL
Updated September, 2007