Performing Aggregated Copies

The aggregated copy operation is a hybrid of a normal copy and an HTAR archive creation. It is the optimum way to move large amounts of data from LC production hosts to HPSS. You define a set of rules which describe what parts of your directory tree should be aggregated into HTAR archives, and the remaining parts are transferred with normal copies.

Why is this the optimum way to move your data to HPSS? Let's consider the alternatives. One option is to do a straight copy of your entire directory structure into HPSS. This is fine if you only have a few files or if your files are all multi-GB in size. Otherwise, there's a tremendous time and HPSS resource overhead in copying files this way. The other alternative is to use HTAR to aggregate your files into an indexed tar file in HPSS. Because of parallelism and the lack of separate connections required for the individual file copy approach, this is 10+ times faster than the first option. The drawback of HTAR is that for best results you should keep the final archive to 300GB or less. If you have a giant directory structure with millions of files and TBs of data, creating a single HTAR file from that will make it difficult to retrieve data later.

Hopper's aggregated copy operation breaks up massive transfers into pieces you define, such that you get the performance benefit of using HTAR but without the risks of putting everything into one massive HTAR file. These pieces are described using patterns that define the contents of each HTAR file to be created. For example, you can use a pattern such that each cycle's restart directory goes into its own HTAR file. (See examples section below.)

Aggregated Copy Dialog

After you select "Aggregated Copy Here..." from the drop menu, a dialog is displayed that lets you describe in detail how the operation will function. The upper part of the dialog lets you describe the rules for defining what goes into each HTAR archive. These rules can be saved and re-used later. Also, various options for modifying the standard behavior of the operation can be set. For convenience, various rule sets have been pre-defined.

Htar Archive Name

Click on the outer "+" button to add an HTAR archive definition. The field "Htar archive name" contains the name of the archive to create. For example, if you want all of your C source files to go into one HTAR archive, you could enter "c_files.htar" as the name. There are two special variables one can use in this name field to indicate that a series of HTAR files should be created: %c is a substitute for a string of digits (i.e., cycle number), and %n is a substitute for the name matched by the File/Directory elements. See the examples section for more details.

Htar Archive Elements

Click on the inner "+" button to add elements to the HTAR archive definition. These are the files and/or directories which make up the contents of the HTAR file. The field can contain normal Unix wildcard patterns, plus the %c special variable defined above:

  • * — Matches zero or more instances of any character
  • ? — Matches one instance of any character
  • [...] — Matches any of characters enclosed by the brackets. Additionally if the first character following the opening bracket is ^, then any character not in the character class is matched. A - between two characters can be used to denote a range.

Managing Rules

The "Load Set", "Save Set", and "Delete Set" buttons are for managing the rules definitions created in the Aggregates section of the dialog. To store a rule set for future use, click "Save Set". To load a previously saved set, click "Load Set". To remove a rule set definition, click "Delete Set". Note that the most recently used rule is pre-loaded when the Aggregated Copy dialog is displayed. You can switch to a different rule with "Load Set". If you prefer not to have a rule automatically loaded in this way, you can disable the behavior with a user preference at File->Preferences->Operations->Aggregated Copy.

Exclude

The Exclude field allows you to specify a blank-delimited glob (Unix wildcard) pattern for matching items that should be excluded from the aggregate copy. For example, if you want to ignore core files and Emacs back-up files, use "core *~" in this field. Note that these exclusions apply to non-aggregated files and to top-level items that would be aggregated. Exclusion does not apply to items in sub-directories of aggregated directories.

Options

Several options allow you to control the behavior of the aggregated copy operation. The Synchronize option, on by default, prevents existing files from being re-copied unless the timestamp of the file in the source directory is newer than the one in HPSS. The Aggregate Top Level option, on by default, causes the aggregation rules to be applied to the top-level items (i.e., the items you selected and have dragged over to the storage window). The Dry Run option, off by default, lets you see the result of performing the selected operation without actually performing any copies. You can see the results of this operation in the Transfer Manager dialog, just like any other copy. The Dry Run option is especially useful when you're putting together new rule sets.

Note that the initial settings of these options can be controlled via user preferences; see File->Preferences->Operations->Aggregated Copy.

Examples

Example 1 - Simulation Output with Cycle Numbers

run1/
        abc00000/  
        abc00000.root
        abc00000_vis/
        abc00001/
        abc00001.root
        abc00001_vis/
        ...
        abc00555/
        abc00555.root
        abc00555_vis/
        input.txt
        errors.txt

Each of the "abc" directories, such as abc00000, contains hundreds or thousands of files. The objective is to gather all results from a single cycle into one HTAR archive. Thus, everything from cycle 0 (the directory abc00000, the directory abc00000_vis, and the file abc00000.root) would go into one HTAR file, everything from cycle 1 into another, and so on. Because the entire run1 directory may contain many TBs of data, the aggregated copy is an ideal way to storage this information.

How to describe this in the aggregated copy dialog:

Htar archive name abc%c.htar Note the use of %c to designate the cycle number. %c expands to a specific set of contiguous integers.
Directory matching abc%c The %c cycle number matches the one used in the name.
Directory matching abc%c_vis The %c cycle number matches the one used in the name.
File matching abc%c.root The %c cycle number matches the one used in the name.

Example 2 - Each Sub-Directory into its own HTAR Archive

dir/
        docs/
        input/
        output/  
        sources/
        file1
        file2

In this example we want each sub-directory to be written into its own HTAR archive. This makes use of another special variable, %n, which maps to the name matched by the Unix wildcard.

How to describe this in the aggregated copy dialog:

Htar archive name %n.htar Note the use of %n to designate the name of the directory matched in the next section. This allows the resulting HTAR file to have a name matching the directory.
Directory matching * I.e., match any directory

For this set of rules, it is important to pay attention to the value of the "Aggregate Top Level" option. If you drag "dir" to storage, and have "Aggregate Top Level" checked, then you'll end up with a single HTAR "dir.htar" containing everything. You should either uncheck the "Aggregate Top Level" option if you want to drag "dir", or else leave the box checked and drag the contents of "dir" instead—i.e., "doc," "input," "output," etc. This latter operation will result in a "docs.htar," "input.htar," "output.htar" as well as some individual file copies.

Notes

Note that the output from the aggregated copy contains summary information about the operation, including number of files written, number of htar archives created, and so on. Importantly, any htar archives that are overwritten, e.g., as the result of using the "update" mode of aggregated copy, are displayed with a line such as:

(I) Overwriting existing archive: /users/u02/jwlong/junk/test.htar: old size = 11651072, new size = 10468864

If the htar size shrunk considerably, this could be the result of files being purged or intentionally removed from the source directory. Be sure to monitor this if you fear that your files may have been purged. If you do find that an htar archive was overwritten with an unintentionally smaller subset, the original htar archive can be retrieved from the Hopper trash can in HPSS, located at ~/.HopperTrashCan in your storage home directory.