Using SweGrid resources

From SNIC Documentation
Revision as of 12:30, 28 September 2011 by Joel Hedlund (NSC) (talk | contribs)
Jump to: navigation, search

< Getting started with SweGrid

This chapter will go through the basic steps needed to get on the grid. The basic procedure is not much different from the procedure used when using a normal cluster resource.

  1. Authentication
  2. Defining the job parameters
  3. Job submission
  4. Job monitoring

Authentication on the grid done by delegation. A short lived proxy certificate is created and delegated to the resource used, see :ref:`creating-proxy-cert`. This gives the resource the mandate to act as the delegated user when accessing storage resources specified by the job.

On a normal compute resource jobs are submitted to the queuing system typically as a special script containing the job parameters such as, walltime, number of processors and memory requirements. The job script also contains the actual statements to execute the job. Jobs on grid resources consists of a job description written in one of the available job description languagues, XRSL, JSDL or JDL and a setup input files and scripts. For the current SweGrid resources the ARC middleware 0.8.x only supports the XRSL job description language. The job descriptions describe the job parameters and files. The description file itself is not executable, but contain references to which file is going to be used to execute the job.

Job submission and monitoring is done in a similar way as on an existing resource. The only difference is that the tools for job submission and monitoring are executed on the users own computer.

Describing you grid job

When a job is to be submitted on the SweGrid resource it is described in a special task description language. The NorduGrid software uses XRSL for describing a grid task. A XRSL contains a set of attribute definitions. The files starts with a & (AND) to define the default relation between the attributes. Every attribute definition is enclosed in ( and ). An example:

&(executable="/bin/echo")(arguments="Hello, World!")

This example defines the attributes executable, defining the executable used (/bin/echo) and the arguments attribute defining the arguments that will be used with the executable. The & means that all attributes must be used. The output from this job will be to write "Hello, World!" to the standar output file of the job.

Specifying an executable and arguments

An executable specified without any slashes is treated like a local file and transferred to the remote system and executed. If the executable is given with a leading slash it is treated as if it where located at the remote machine. See the following examples:

&(executable="/bin/echo")(arguments="Hello, World!")

In this example the executable /bin/echo is treated as a remote executable located in the system folder /bin.:

&(executable="bin/echo")(arguments="Hello, World!")

In this example the executable is treated as a local file located in a directory bin relative to the current directory and automatically transferred to the remote system.

The executable will be called with the arguments specified in the arguments attribute as shown in the above examples.

Handling job input and output

In the previous example the output of the application are silently thrown away. If the output and input is needed for the application this can be specified by using the attributes stdout, stdin and stderr. The input for the attributes are files. If stdin is used the input file used for standard input must be transferred as an input file, see next section. An example of using standard input and output is shown in the following listing:

&
(executable="/bin/ls")
(arguments="-la")
(stdout="stdout.txt")
(stderr="stderr.txt")

Here standard output is directed to the file stdout.txt and standard error to stderr.txt. If standard input is used a typical XRSL description can be:

&
(executable="myapp")
(stdout="stdout.txt")
(stderr="stderr.txt")
(stdin="stdin.txt")
(inputFiles=("stdin.txt" ""))

As the input specifies a file is has to be transferred to the resource, which is the last attribute, inputFiles, described in more detail in the next section.

Giving jobs meaningful names

To make it easier retrieving jobs, meaningful names can be given to a job using the jobName attribute. This name can then be used by the ARC commands instead of the normal job id to refer to jobs. In the following example the job is given the name job0001:

&
(executable="myapp")
(stdout="stdout.txt")
(stderr="stderr.txt")
(stdin="stdin.txt")
(inputFiles=("stdin.txt" ""))
(jobName="job0001")

Specifying input and output files

Jobs often depend on a set of input files that must be transferred to the job directory on the grid resource before execution. When the job has finished it often also produces a set of output files that should be kept or transferred to other storage resources. Input and output files in are defined using the inputFiles and outputFiles attributes in XRSL.

Input and output files can be accessed and transferred to and from other resources than the users client machine. The default option is to use the users local directory to transfer and receive files. There are however extra parameters can be specified to specify files located on specific URL:s.

The syntax for the inputFiles directive is as follows:

(inputFiles=(<filename> <source>) ... )

filename is the filename that will be written to the job directory. source specifies an from where the input file will be retrieved. Source can be both a local directory or a URL. If source is empty, the input files is taken from the directory from where the job is submitted.

The syntax for the outputFiles directive is as follows:

(outputFiles=(<string> <URL>) ... )

string is a file located in the job directory on the computational resource. URL sets the destination where the output file should be transferred after job execution. If string is set to "/" and the URL is empty the entire job directory is kept for later retrieval by the user. If the string is set to "/" and the URL is not empty the entire job directory is transferred to the destination.

In the following example all input files are located in the job submission directory. The output files outputfile1.dat and outputfile2.dat will be kept in the job directory on the computationa resources until the user retrieves the files:

&
(executable="myapp")
(stdout="stdout.txt")
(stderr="stderr.txt")
(stdin="stdin.txt")
(inputFiles=
    ("stdin.txt" "")
    ("datafile1.dat" "")
    ("datafile2.dat" "")
)
(outputFiles=
    ("outputfile1.dat" "")
    ("outputfile2.dat" "")
)

A similar example, using only files located on external resources is shown below:

&
(executable="myapp")
(stdout="stdout.txt")
(stderr="stderr.txt")
(stdin="stdin.txt")
(inputFiles=
    ("stdin.txt" "http://www.swegrid.se/example/stdin.txt")
    ("datafile1.dat" "gsiftp://swegrid.se/storage/datafile1.dat")
    ("datafile2.dat" "rc://swegrid.se.se/datafile2.dat")
)
(outputFiles=
    ("outputfile1.dat" "srm://swegrid.se/storage/outputfile1.dat")
    ("outputfile2.dat" "srm://swegrid.se/storage/outputfile2.dat")
)

Sometimes it is useful to transfer all files in the output directory or input directory. The following example shows how this is accomplished:

&
(executable="myapp")
(stdout="stdout.txt")
(stderr="stderr.txt")
(stdin="stdin.txt")
(inputFiles=
    ("/" "")
)
(outputFiles=
    ("/" "")
)

An URL can also be used in conjunction with the "/" attribute. This will transfer the entire output directory to the specific resource defined in the URL.

Specifying resource usage

An important part of the job description is specifying job resource limits such as required walltime, nodes and memory. If any of these parameters are not given the default limits of the resource will be used, which can differ on different resources.

walltime is given in by the wallTime attribute. The time can be given in many different unit. If no unit is specified the minutes are assumed. The following lists show different allowed wallTime specifications:

1 week
3 days
2 days, 12 hours
1 hour, 30 minutes
36 hours
9 days
240 minutes
240

Memory and disk requirements are given in MB. Memory and disk requirement are usually given using relational operators such as >=. The reason for this is that you often need at least the amount of memory or disk. The disk attribute is to be avoided as it is better to specify disk using a runtime enviromnent described in the following sections. The following example illustrate how the wallTime and memory attributes can be used:

&
(executable="myapp")
(stdout="stdout.txt")
(stderr="stderr.txt")
(stdin="stdin.txt")
(wallTime=240)
(memory>=500)
(inputFiles=
    ("stdin.txt" "")
    ("datafile1.dat" "")
    ("datafile2.dat" "")
) (outputFiles=
    ("outputfile1.dat" "")
    ("outputfile2.dat" "")
)

Runtime environments

A runtime environment is special script that will setup a number of standard variables and search paths for applications or special application needs. The runtime environment shields the user from differences the available grid-resources. The available runtime environments are published in the information system which guarantees that they will only be submitted to resources with correct environments installed.

There are a number of supported runtime environments available on the SweGrid resources, listed at:

* http://www.nordugrid.org* http://docs.swegrid.se

In the job description, runtime environments are specified using the runTimeEnvironment attribute. Runtime environments support versioning, so different version can be specified. If no version i specified the highest version is chosen. If not a specific version is required the relational operator >= should be used to select the minimum required version. If using an environment that sets up the path for the application executable on the remote resource. The executable for the application should not be specified in the executable attribute, but in a script-file that is passed as the executable file in the xRSL file. An example:

&
(executable=run.sh)
(arguments=inputfile.dat)
(inputFiles=(intputfile.dat ""))
(outputFiles=(outputfile.dat ""))
(wallTime=240)
(runTimeEnvironment>=MYAPP-1.42)

The run.sh is a script-file using the paths set up in the runtime-environment. An example run.sh is illustrated below:

!/bin/sh
myapp $1

The myapp executable is available by the environment setup by the runtime environment MYAPP-1.42.

Job log information

As an aid in debugging grid jobs additional information on the execution of the job can be added to the job results using the gmlog attribute. The attribute defines a directory containing job logs and other useful information for debugging. The following example shows a job description with the gmlog attribute set:

&
(executable=run.sh)
(wallTime="5 minutes")
(stdout="stdout.txt")
(stderr="stderr.txt")
(gmlog="gm.log")

In this example a special directory "gm.log" will be added to the retrieved job directory containing the following files:

  • description - contains the parsed and transformed XRSL description transferred to the resource.
  • diag - front-end and job information.
  • errors - complete log of job activity.
  • input - job input files.
  • local - local job information specific to resource management system.
  • output - job output files.
  • status - job status. FINISHED/FAILED etc.

Job submission

When the job description has been created and any additional needed files have been setup, the job can be submitted to a grid resource. In ARC jobsubmission is done using the ngsub command. The command is similar to the qsub command found on non-grid resources. The job submission procedure can be described in the following steps:

  1. Parse XRSL definition.
  2. Query information system for available resources taking in any constraints defined in the XRSL definition such as memory, wallTime and runtime environments.
  3. Submit job to selected resource. Transferring any files local to the submission directory (if any).

The general syntax of the ngsub command is as follows:

ngsub [options]

The most important option is the -f option which defines the job description file to be used for job submission.

To illustrate the job submission process we use the following example descriptions and scripts.

Job description:

&
(executable=run.sh)
(wallTime="5 minutes")
(stdout="stdout.txt")
(stderr="stderr.txt")

Executable script run.sh:

#!/bin/sh
echo "Hello, grid"

The simples form of submission of this job is shown in the following example:

[user@localhost ex1]$ ngsub -f ex1.xrls
Job submitted with jobid: gsiftp://siri.lunarc.lu.se:2811/jobs/2817512964675921399075910

If the submission is succesful the command displays the job id of the job. The job id is a URL which uniquely identfies the job and the resource to which it was submitted. The ngsub command also stores submitted job id:s in the $HOME/.ngjobs file for later use by other commands.

The see more output from the job submission the -d flag can be used with a parameter for the level of debug output. It is often enough to use -d 1 to get more useful information as in the following example:

[user@localhost ex1]$ ngsub -d 1 -f ex1.xrls
Proxy subject name: /O=Grid/O=NorduGrid/OU=lunarc.lu.se/CN=User/CN=1212121
Proxy valid to: 2011-01-31 22:32:58
Proxy valid for: 10 hours, 32 minutes, 59 seconds
Queue selected: arc@arc-ce.smokerings.nsc.liu.se
File uploaded: /tmp/user/rsl.1lY843
File uploaded: /home/user/usersguide/examples/ex1/run.sh
Job submitted with jobid: gsiftp://arc-ce.smokerings.nsc.liu.se:2811/jobs/213512964716031312926006

The debug output also shows information on proxy lifetime, queue used and which files that have been uploaded to the used resource.

In some cases the information system on some resources is overloaded. This means that the job submission can get stuck waiting for a response. To limit the time waiting for non-responsive sites, the -t flag can be used to set a timeout in seconds. The following example shows a job submission with the -t flag set to 20 seconds:

[user@localhost ex1]$ ngsub -t 20 -f ex1.xrls

It is also possible to bypass the resource brokering and submit a job directly to a resource using the -c switch. The -c switch takes a hostname for the resource as input and will only submit to this resource. The -c switch can be given repeatedly to submit to multiple resources. It is also possible to reject a specific cluster using the switch by adding a minus sign in front of the hostname. The following examples illustrate different options of using this switch.

Job submission directly to the resource given by siri.lunarc.lu.se:

[user@localhost ex1]$ ngsub -f ex1.xrls -c siri.lunarc.lu.se

Job submission to all available resources `except` siri.lunarc.lu.se:

ngsub -f ex1.xrls -c -siri.lunarc.lu.se

Instead of using the default $HOME/.ngjobs job file a user defined job list file can be specified using the -o switch. This can be useful when submitting a number of jobs in a parameter sweep:

[user@localhost ex1]$ ngsub -o job_sweep1 -f ex1.xrsl

Additional options for the ngsub command can be found by using the -h switch:

[user@localhost ex1]$ ngsub -h
Usage: ngsub [options] [filename ...]

Options:
  -c, -cluster   [-]name       explicity select or reject a specific cluster
  -C, -clustlist [-]filename   list of clusters to select or reject
  -g, -giisurl   url           url to a GIIS
  -G, -giislist  filename      list of GIIS urls
  -e, -xrsl      xrslstring    xrslstring describing the job to be submitted
  -f, -file      filename      xrslfile describing the job to be submitted
  -o, -joblist   filename      file where the jobids will be stored
  -D  -dryrun                  add dryrun option
      -dumpxrsl                do not submit - dump transformed xrsl to
                                 stdout
  -t, -timeout   time          timeout in seconds (default 20)
  -d, -debug     debuglevel    from -3 (quiet) to 3 (verbose) - default 0

  ...

Job status information

The status of the submitted job can be queried using the ngstat command which is similar to the qstat or showq commands on normal computational resources. The command takes a job id as input and queries the resource for information on the status of the job. This is shown in the following examples:

[user@localhost ex1]$ ngstat gsiftp://siri.lunarc.lu.se:2811/jobs/261871296472351384107384
Job gsiftp://siri.lunarc.lu.se:2811/jobs/261871296472351384107384
  Status: FINISHED
  Exit Code: 0

This shows the status and exit code of the job, in this case that it has been executed and returned with an exit code of 0.

To query all submitted jobs found in the $HOME/.ngjobs file, the -a switch can be used. A typical output from this command is show below:

[user@localhost ex1]$ ngstat -a
Job gsiftp://siri.lunarc.lu.se:2811/jobs/2817512964675921399075910
  Status: FINISHED
  Exit Code: 0
Job gsiftp://arc-ce.smokerings.nsc.liu.se:2811/jobs/213512964716031312926006
  Status: INLRMS:Q
Job gsiftp://arc-ce.smokerings.nsc.liu.se:2811/jobs/359712964719111059953303
  Status: INLRMS:Q
Job gsiftp://siri.lunarc.lu.se:2811/jobs/261871296472351384107384
  Status: FINISHED
  Exit Code: 0
Job gsiftp://arc-ce.smokerings.nsc.liu.se:2811/jobs/84411296472389643492645
  Status: INLRMS:Q

If more information on the job is needed, the -l switch can be used to provide additional information on the job as in the following example:

[user@localhost ex1]$ ngstat -l gsiftp://siri.lunarc.lu.se:2811/jobs/261871296472351384107384
Job gsiftp://siri.lunarc.lu.se:2811/jobs/261871296472351384107384
  Status: FINISHED
  Exit Code: 0
  Owner: /O=Grid/O=NorduGrid/OU=lunarc.lu.se/CN=User Userson
  Cluster: siri.lunarc.lu.se
  Queue: arc
  Requested Number of CPUs: 1
  Execution Nodes:
    sn001
    sn001.mpi
  stdout: stdout.txt
  stderr: stderr.txt
  Submitted: 2011-01-31 12:12:32
  Completed: 2011-01-31 12:12:41
  Submitted from: 81.230.189.149:48701;localhost.localdomain
  Submitting Client: nordugrid-arc-0.8.3.1
  Required CPU Time: 5 minutes
  Used CPU Time: 0
  Used Wall Time: 1 minute
  Results must be retrieved before: 2011-02-11 03:19:21
  Proxy valid to: 2011-01-31 22:32:58
  Entry valid from: 2011-01-31 12:43:07
  Entry valid to: 2011-01-31 12:44:37

The ngstat command can also used to query status of jobs from job lists created with the -o switch in the ngsub command. In ngstat this is accomplished using the -i switch:

[user@localhost ex1]$ ngstat -a -i job_sweep1

The options timeout (-t), debug (-d) and -c options can be used in the same way as in the ngsub command.

Retrieving finished jobs

When a job has finished executing on a grid resource the job output files and results can be downloaded using the ngget command. The general syntax:

ngget [options] [jobid|jobname]

Retrieving a single job is accomplished by using ngget and the job identifier:

[user@localhost ex2]$ ngget gsiftp://siri.lunarc.lu.se:2811/jobs/126181296507986385553351
Results stored at /home/user/usersguide/examples/ex2/126181296507986385553351
Jobs processed: 1, successfuly downloaded: 1

Retrieveing a job by the job name attributes can be done by specifying the job name instead of the job id. In the following example the job description had the jobName attribute set to job0001:

[user@localhost ex3]$ ngget job0001
Results stored at /home/user/usersguide/examples/ex3/69012965093911208764313
Jobs processed: 1, successfuly downloaded: 1

Downloading all jobs in the $HOME/.ngjobs file is done by using the -a switch:

[user@localhost ex3]$ ngget -a
Results stored at /home/user/usersguide/examples/ex3/131291296508001121515985
Results stored at /home/user/usersguide/examples/ex3/133541296508018329511759
Jobs processed: 2, successfuly downloaded: 2

By default downloaded jobs are stored in directories with the same name as the last part of the job id. A job with the job id gsiftp://siri.lunarc.lu.se:2811/jobs/126181296507986385553351 will be stored in the directory 126181296507986385553351 in the same directory as the ngget command was executed. This behavior can be changed using the -dir switch. Using the -dir switch will place create the downloaded job directories in the directory specified by the switch. The following example will download all jobs to the job_sweep1 directory:

[user@localhost ex3]$ ngget -a -dir ./job_sweep1
Results stored at ./job_sweep1/198512965106411445114039
Results stored at ./job_sweep1/221412965106421926028190
Results stored at ./job_sweep1/265812965106431621700746
Results stored at ./job_sweep1/293912965106452076344440
Jobs processed: 4, successfuly downloaded: 4

It is also possible to use the job name as the job directory by using the -j switch as shown in the following example:

[user@localhost ex3]$ ngget -a -j -dir ./job_sweep2
Results stored at ./job_sweep2/job0001
Results stored at ./job_sweep2/job0002
Results stored at ./job_sweep2/job0003
Results stored at ./job_sweep2/job0004
Jobs processed: 4, successfuly downloaded: 4

Killing running jobs

If for some reason you need to kill any of the jobs submitted to a resource the ngkill command can be used. In the most basic form the command takes a job id or a job name as input. Killing a job using a job id is shown below:

[jonas@localhost ex4]$ ngkill gsiftp://siri.lunarc.lu.se:2811/jobs/312511297331801573492658
Proxy subject name: /O=Grid/O=NorduGrid/OU=lunarc.lu.se/CN=Jonas Lindemann/CN=792268717
Proxy valid to: 2011-02-10 21:11:16
Proxy valid for: 10 hours, 9 minutes, 20 seconds
Killing job: gsiftp://siri.lunarc.lu.se:2811/jobs/312511297331801573492658
Deleting job: gsiftp://siri.lunarc.lu.se:2811/jobs/312511297331801573492658
Jobs processed: 1, killed: 1, deleted: 1

killing a job using a job name is done in a similar procedure:

[jonas@localhost ex4]$ ngkill job0001
Proxy subject name: /O=Grid/O=NorduGrid/OU=lunarc.lu.se/CN=Jonas Lindemann/CN=792268717
Proxy valid to: 2011-02-10 21:11:16
Proxy valid for: 10 hours, 9 minutes, 39 seconds
Killing job: gsiftp://siri.lunarc.lu.se:2811/jobs/3071412973317791892872442
Deleting job: gsiftp://siri.lunarc.lu.se:2811/jobs/3071412973317791892872442
Jobs processed: 1, killed: 1, deleted: 1

To kill all runnig jobs, the -a switch can be used, which is illustrated in the following example:

[jonas@localhost ex4]$ ngkill -a
Proxy subject name: /O=Grid/O=NorduGrid/OU=lunarc.lu.se/CN=Jonas Lindemann/CN=792268717
Proxy valid to: 2011-02-10 21:11:16
Proxy valid for: 10 hours, 4 minutes, 25 seconds
Killing job: gsiftp://siri.lunarc.lu.se:2811/jobs/130621297332339928417805
Deleting job: gsiftp://siri.lunarc.lu.se:2811/jobs/130621297332339928417805
Killing job: gsiftp://siri.lunarc.lu.se:2811/jobs/1324612973323401688656595
Deleting job: gsiftp://siri.lunarc.lu.se:2811/jobs/1324612973323401688656595
Killing job: gsiftp://siri.lunarc.lu.se:2811/jobs/1327812973323411385420097
Deleting job: gsiftp://siri.lunarc.lu.se:2811/jobs/1327812973323411385420097
Killing job: gsiftp://siri.lunarc.lu.se:2811/jobs/1329512973323431857144927
Deleting job: gsiftp://siri.lunarc.lu.se:2811/jobs/1329512973323431857144927
Jobs processed: 4, killed: 4, deleted: 4

Custom job lists can also be used by specifying the job list file using the -i switch.

Cleaning job data on resources

The output files and logs from finished jobs are kept a couple of days on the resource and will be eventually erased automatically. If you are not interested in downloading results from jobs or want to remove old job results, the ngclean command can instruct the resources to clean the data from the jobs. ngclean works in the same way as ngkill. Cleaning a job using a job id can be achieved in the following way:

[jonas@localhost ex4]$ ngclean gsiftp://siri.lunarc.lu.se:2811/jobs/1388112973341962029409606
Proxy subject name: /O=Grid/O=NorduGrid/OU=lunarc.lu.se/CN=Jonas Lindemann/CN=792268717
Proxy valid to: 2011-02-10 21:11:16
Proxy valid for: 9 hours, 22 minutes, 28 seconds
Deleting job: gsiftp://siri.lunarc.lu.se:2811/jobs/1388112973341962029409606
Jobs processed: 1, deleted: 1

Job names can also be used:

[jonas@localhost ex4]$ ngclean job0001
Proxy subject name: /O=Grid/O=NorduGrid/OU=lunarc.lu.se/CN=Jonas Lindemann/CN=792268717
Proxy valid to: 2011-02-10 21:11:16
Proxy valid for: 9 hours, 24 minutes, 45 seconds
Deleting job: gsiftp://siri.lunarc.lu.se:2811/jobs/135081297334194489234891
Jobs processed: 1, deleted: 1

All jobs can be cleaned by using the -a switch:

[jonas@localhost ex4]$ ngclean -a
Proxy subject name: /O=Grid/O=NorduGrid/OU=lunarc.lu.se/CN=Jonas Lindemann/CN=792268717
Proxy valid to: 2011-02-10 21:11:16
Proxy valid for: 9 hours, 16 minutes, 40 seconds
Deleting job: gsiftp://siri.lunarc.lu.se:2811/jobs/137781297334195171034347
Deleting job: gsiftp://siri.lunarc.lu.se:2811/jobs/1394612973341971718841954
Jobs processed: 2, deleted: 2

Custom job lists can also be used by specifying the job list file using the -i.