LC earlier notified users about a Lustre file size mismatch problem (aka 4k block size problem) that resulted in inflated file sizes and Lustre server instability. NOTE Although the software bug causing the issue has been fixed, fixing files affected by the bug is critical.

To assist in the cleanup from the bug, LC has generated a list of affected files for each user and project directory on the OCF and SCF Lustre file systems. These lists are under each Lustre "home" and group/project directory (e.g. /p/lustre1/johndoe2 or /p/lustre1/<projectname>). The lists include those files that still need to be fixed as of the date in the filename.

There is one list per (user|project + file-owner-username) combination, named “size-mismatch-affected-files-2025-aug-XX.txt”. For example, if Jane doe is part of "project42," she would have one list for her own Lustre directory and a separate one for her files within the project42 directory, and these lists would be readable only by her:

  • /p/lustre1/janedoe1/size-mismatch-affected-files.janedoe1.2025-aug-17.txt
  • /p/lustre1/project42/size-mismatch-affected-files.janedoe1.2025-aug-17.txt

LC has deployed a new utility, /usr/bin/find-file-size-mismatches which checks each file on the list, then reports files which haven’t yet been fixed, and optionally fixes them. The option to fix files is “--replace". It does this, essentially, by:

  cp <problem_file> <temp_file>
  rm <problem_file>
  mv <temp_file> <problem_file>

The prior guidance to users, to perform the above steps manually, is still accurate; this tool is intended to make the process easier.

WARNING/usr/bin/find-file-size-mismatches is not safe for concurrent access of files. The user must ensure the files and directories modified by this script are not being modified or used by other processes, including multiple instances of /usr/bin/find-file-size-mismatches.

In our tests, this tool fixed files at a rate of about 100k per hour or 55 GB per hour. However performance will vary based on your file size and how busy the file system is. We recommend using this tool if you have less than 500k files to fix. You can stop the tool via “kill” or control-C and then run it again later and it will resume with any files not yet fixed.

The new utility logs its actions to the user’s Lustre directory at

/p/<lustre_fs>/<user_or_group>/.size-mismatch-affected-files.<user>.<date>.<uid>.log

An example, for user janedoe1 on CZ file system lustre1:

# if run just to list files that need to be fixed
(oslic8):~$ /usr/bin/find-file-size-mismatches  /p/lustre1/janedoe1/size-mismatch-affected-files.2025-aug-17.txt
/p/lustre1/janedoe1/test.dd.36
/p/lustre1/janedoe1/test.dd.37

# run to fix all files for user janedoe1 on lustre1
(oslic8):~$ /usr/bin/find-file-size-mismatches --replace  /p/lustre1/janedoe1/size-mismatch-affected-files.2025-aug-17.txt
2025-08-19T17-56-44 - find-file-size-mismatches:159 - INFO: starting replacement of '/p/lustre1/janedoe1/test.dd.36'
2025-08-19T17-56-45 - find-file-size-mismatches:221 - INFO: '/p/lustre1/janedoe1/test.dd.36' replaced: usage reduced by approximately 27136 bytes
2025-08-19T17-56-45 - find-file-size-mismatches:159 - INFO: starting replacement of '/p/lustre1/janedoe1/test.dd.37'
2025-08-19T17-56-45 - find-file-size-mismatches:221 - INFO: '/p/lustre1/janedoe1/test.dd.37' replaced: usage reduced by approximately 27136 bytes

# now there should be no more files which need to be fixed
(oslic8):~$ /usr/bin/find-file-size-mismatches /p/lustre1/janedoe1/size-mismatch-affected-files.2025-aug-17.txt

# and the log file shows what was done
(oslic8):~$ head -n1 /p/lustre1/janedoe1/.size-mismatch-affected-files.2025-aug-17.28153.log
2025-08-19T17-56-44 - find-file-size-mismatches:159 - INFO: starting replacement of '/p/lustre1/janedoe1/test.dd.36'

This new utility is too slow for users with very large numbers of affected files or very large files (>10TB or so), because it is single-threaded.  We recommend use of mpiFileUtils for these users; please contact the LC Hotline for if you need assistance.

By asking individual users to run the tool, the intent is to reduce the coordination required and reduce the chance of a misunderstanding resulting in unsafe concurrent access.


Notes:

We are working on some statistics for number of files per user-or-project and number of files by file-size, so the Hotline (or someone) can reach out to those users for whom the tool is too slow.

We are working on a multithreaded tool, to help a larger user set. Our goal with this release is to let people make forward progress while we improve the tools.

We will gather data on affected files again in the future. This will let us know which files haven’t yet been addressed, so we can update users.