Green Data Oasis Usage
Green Data Oasis (GDO) is a large data store (1.4 PB) on the unrestricted LLNL network. It is intended to facilitate the sharing of scientific data with external collaborators. Support of these capabilities requires the following activities:
- Configuring and accessing the GDO
- Building and installing project software
- Moving data to and from the GDO
- Controlling access to data
- Receiving external data and making it available
- Operational support
The M&IC Program Leaders, the GDO project leader, LC management, and the Laboratory Science and Technology Office (LSTO) will manage policies for usage and access to the GDO system. The baseline set of procedures, tools, and features will be proposed, discussed, defined, agreed upon, and implemented as part of the GDO GA Release process.
We expect the GDO system to be heavily used, given the large number of projects with data sharing needs and the large amounts of data to be shared. To accommodate equitable access to the LLNL community, GDO accounts and disk allocations will be granted through an informal Lab-wide process, with decisions made by the M&IC Program Leaders on behalf of the LSTO.
Once a project has been granted access to the GDO, a ZFS disk partition of the size specified by its allocation will be created. A dedicated virtual machine will also be allocated, and user accounts will be created.
Because of the collaborative purpose of the system and the non-sensitive nature of its data, GDO is located on the green network. Host names will be assigned to VMs to indicate the association with the project. For example, project ABC's virtual hosts might be called "gdo-abc.llnl.gov". These project host names are aliases for corresponding numeric host names, such as gdo1, gdo2, and so on.
Only U.S.-citizen LLNL project members will be granted login access to the project's VM. Two-factor authentication using RSA tokens will be required for login accounts. Logins must be done with SSH; e.g., ssh gdo-abc.llnl.gov. Note that SSH connections are only accepted from within the llnl.gov domain. If you want to connect from another (non-LLNL) host, you must first SSH to an LLNL host and then SSH from there to the GDO.
External collaborators will not have login accounts. They will be able to retrieve data using FTP, HTTP, or other approved protocols, and will be able to upload data in certain circumstances if the project has enabled this capability. Local project members without login accounts can use the system in the same way.
Because the GDO is on the green network, all interactions between it and resources on the yellow network must be initiated from within the yellow network. Data transfers involving LC systems must therefore originate on the LC system.
The following table shows some of the key directories in each GDO VM:
|/export/ftp/pub||ZFS area visible to anonymous FTP users. This is where the majority of a project's data will reside.|
|/export/ftp/incoming||An optional ZFS area into which external collaborators will be able to upload data. This is only available via the GDO upload VM (gdo1.llnl.gov).|
|/usr/apps||Project directory for software and information to be shared amongst all users of a project VM.|
Projects will be able to install project-specific software in the /usr/apps directory. Note that software that provides connection or data transfer services must first be approved by the GDO project leader. Client software does not require approval but may require modifications to the GDO firewall if communication is done on a non-standard port.
Standard TOSS compiler and development tools are provided.
There are three major data paths involving the GDO, each of which involves moving data both to and from the GDO: (1) LC production hosts on the OCF network, including oslic and rzslic (2) project hosts on LLNL's unrestricted network, and (3) offsite collaborator hosts.
The GDO is on its own segment of the green network—a direct offshoot from the Lab's ESNet router. This is a 10 Gb/s network connection. The GDO also has access to ESNet's circuit-oriented Science Data Network (SDN) for massive dataset transfers to or from other ESNet sites.
Data movement is a challenging problem, in large part because of the amount of data the GDO will be handling. The standard unit for datasets is now measured in terabytes. There are two significant issues in moving large amounts of data: (1) bandwidth, and (2) tools for achieving high end-to-end throughput. Funds and network engineering can mostly solve the bandwidth issue. To solve the throughput issue it is critical to provide tools that utilize parallel data streams when moving data. This is the only approach that can work around limits imposed by network links, firewalls, and various I/O devices. Parallel transfer options are being investigated for all major data paths involving the GDO so that large aggregate throughput can be achieved for transfers of large files.
A project's large disk space is available in the directory /export/ftp/pub. Project members with login accounts will be able to populate this space with project data to be shared.
GDO's network link to the outside world is via a 10 Gb/s connection to ESNet. Currently, the GDO firewall bottlenecks individual transfer streams at 1.7 Gb/s. The proper utilization of parallel streams when moving data can improve aggregate transfer rates significantly.
Transfers between GDO and other LLNL networks will be limited by the other networks' firewalls, routers, etc. Currently, the unrestricted and yellow networks run at 1 Gb/s, so that will be the limiting factor in transfers involving hosts on that network.
Tools for Transferring Data
For moving data to the GDO, all major data paths will support some form of FTP for transfers. For users with login accounts, SFTP and SCP are also options.
For retrieving data from the GDO, several protocols such as variants of FTP, HTTP, and NFS are supported. When transferring entire files locally or remotely, the user can expect the best performance from FTP for single-thread transfers. For hosts that have enabled GridFTP access, remote clients can improve parallelism by using a utility like Bulk Data Mover (BDM). Read-only file access is also available using NFS to other hosts on the unrestricted network; expected throughput is likely to be 25 MB/s or less initially. Coordination with the GDO project leader is required if a project wishes to NFS export its disk space to another host on the unrestricted network.
Projects may request additional or alternative protocols (GridFTP, SRB, etc.) for providing data access to external collaborators. These requests should be made to the GDO project leader.
For LC production hosts on the yellow network, the standard FTP and PFTP clients can be used for moving data to the GDO, but note that all such transfers are done serially—PFTP cannot do parallel transfers through a firewall. Other tools for doing parallel transfers in this environment are being investigated; GridFTP and BBFTP are two candidates.
The GDO has approximately 1.4 PB of RAID storage, which is split amongst the various projects with GDO accounts. Due to size constraints, this data is not backed up. Users are strongly urged to back up data to other devices (such as HPSS archival storage). Although measures have been taken to prevent data loss, unexpected system failures, power outages, or other problems will eventually cause a loss of data.
A small amount of local disk space is also available and is used for home directories and system data and information. Local home and system directories will be backed up regularly.
Although all data on the GDO is required to be approved for release for general distribution, some projects may wish to control access to their data. Options for controlling access include:
|FTP Access Control||Description of External Access|
|None||Project data is world readable via anonymous FTP.|
|Host||Project data is accessible via anonymous FTP only to those hosts explicitly listed.|
|Host + Time||This option is mandatory when using the upload capability. It is not used for controlling download access. This restricts uploads to just hosts explicitly listed and just for the window of time specified by the project.|
|Virtual Users||Project data is accessible via FTP only to virtual users explicitly defined by the project. The information regarding these virtual users is managed by the project itself and generally consists of a username/password pair stored in a UNIX DBM file. These virtual users log in as normal users but are restricted in much the same way as are anonymous users.|
Other protocols will have their own access control options, but for the most part they will be similar to these FTP options.
Contact the LC Hotline or GDO project leader if you have questions about enabling restrictions.
The GDO supports the ability for collaborators and project members without login accounts to upload data to the GDO. Projects with the need for this capability should express their needs to the GDO project leader. A separate zone has been created for this purpose, with a name of "gdo1.llnl.gov." The disk layout is similar to the layout of a download VM, but there will also be an /export/ftp/incoming directory into which externally uploaded data will go.
Assuming the upload zone has been created, the following is the sequence of steps to perform to get external data and make it available to others:
- LLNL project member (data custodian) uses access control tool on GDO to specify the collaborator's host name from which the upload will occur.
- Remote collaborator uses anonymous FTP to connect to gdo1.llnl.gov from the specified host.
- Remote collaborator uploads data into the /incoming directory. Note that this directory has write access but not read access; therefore, neither the collaborator nor other remote users will be able to see the contents of this directory.
- LLNL data custodian reviews the contents of /export/ftp/incoming and confirms that data are as expected. See the rules in the "Nature of Data Allowed on the GDO " section for the type of external data that is allowed on the GDO.
- If data is confirmed as valid, LLNL data custodian can then move the data into the externally accessible project data space (/export/ftp/pub) in the project's download VM if so desired.
Operational support for this system will include integrated hotline support, timely dissemination of operational information (e.g., scheduled and unscheduled machine or network downtime), and training and documentation that meets the needs of local and remote users.
The LC Hotline is fully staffed from 8:00 a.m.–noon and 1:00–4:45 p.m. Pacific Time and can be accessed by telephone or e-mail. Outside of these hours, callers will have the option of leaving a message or being forwarded to the LC Operations staff, which is present 24x7, 365 days a year. E-mail received outside of business hours will be processed the next business day. In urgent off-hour situations, send e-mail to firstname.lastname@example.org if phone service is unavailable.
Remote collaborators should contact LLNL project collaborators first with questions and requests for assistance.