To assist in the cleanup for the Lustre file size mismatch problem (aka 4k block size problem), we have generated a list of affected files for each user or project directory on the OCF Lustre file systems. These lists are under each user’s “Lustre Home” directory (e.g. /p/lustre1/johndoe2), named “size-mismatch-affected-files-2025-aug-XX.txt”. Those lists include files not fixed as of the date in the filename.
We have a new utility, /usr/bin/find-file-size-mismatches which checks each file on the list, then reports files which haven’t yet been fixed.
This new utility has an option to fix files which need it, “--replace". It does this, essentially, by:
cp <problem_file> <temp_file> rm <problem file> mv <temp file> <original name>
The prior guidance to users, to perform the above steps manually, is still accurate. This tool is intended to make the process easier.
WARNINGthat /usr/bin/find-file-size-mismatches is not safe for concurrent access. The user must ensure the files and directories modified by this script are not being modified or used by other processes, including multiple instances of /usr/bin/find-file-size-mismatches.
In our tests, this tool fixed files at a rate of about 100k per hour or 55 GB per hour. However performance will vary based on your file size and how busy the file system is. We recommend using this tool if you have less than 500k files to fix. You can stop it via “kill” or control-C and then run it again later and it will resume with any files not yet fixed.
The new utility logs its actions to the user’s Lustre directory at
/p/<lustre_fs>/<user_or_group>/.size-mismatch-affected-files.<date>.<uid>.log
An example, for user janedoe1 on CZ file system lustre1:
# if run just to list files that need to be fixed (oslic8):~$ /usr/bin/find-file-size-mismatches /p/lustre1/janedoe1/size-mismatch-affected-files.2025-aug-17.txt /p/lustre1/janedoe1/test.dd.36 /p/lustre1/janedoe1/test.dd.37 # run to fix all files for user janedoe1 on lustre1 (oslic8):~$ /usr/bin/find-file-size-mismatches --replace /p/lustre1/janedoe1/size-mismatch-affected-files.2025-aug-17.txt 2025-08-19T17-56-44 - find-file-size-mismatches:159 - INFO: starting replacement of '/p/lustre1/janedoe1/test.dd.36' 2025-08-19T17-56-45 - find-file-size-mismatches:221 - INFO: '/p/lustre1/janedoe1/test.dd.36' replaced: usage reduced by approximately 27136 bytes 2025-08-19T17-56-45 - find-file-size-mismatches:159 - INFO: starting replacement of '/p/lustre1/janedoe1/test.dd.37' 2025-08-19T17-56-45 - find-file-size-mismatches:221 - INFO: '/p/lustre1/janedoe1/test.dd.37' replaced: usage reduced by approximately 27136 bytes # now there should be no more files which need to be fixed (oslic8):~$ /usr/bin/find-file-size-mismatches /p/lustre1/janedoe1/size-mismatch-affected-files.2025-aug-17.txt # and the log file shows what was done (oslic8):~$ head -n1 /p/lustre1/janedoe1/.size-mismatch-affected-files.2025-aug-17.28153.log 2025-08-19T17-56-44 - find-file-size-mismatches:159 - INFO: starting replacement of '/p/lustre1/janedoe1/test.dd.36'
This new utility is too slow for users with very large numbers of affected files or very large files (>10TB or so), because it is single-threaded. We recommend use of mpiFileUtils for these users; please contact the LC Hotline for if you need assistance.
The intent is to reduce the coordination required, reduce the chance of a misunderstanding resulting in unsafe concurrent access, and spread the work among more people (“many hands make light work”) to get it done faster.
More examples of the path to the lists of affected files. NOTE the data for different file systems was collected on different days.
- /p/lustre1/aname1/size-mismatch-affected-files.2025-aug-15.txt
- /p/czlustre1/fname1/size-mismatch-affected-files.2025-aug-17.txt
- /p/czlustre2/fname1/size-mismatch-affected-files.2025-aug-17.txt
- /p/czlustre5/aname2/size-mismatch-affected-files.2025-aug-14.txt
These files are owned by root, same group as their directory, and so are readable by the user or project the list corresponds to.
Notes:
We are working on some statistics for number of files per user-or-project and number of files by file-size, so the Hotline (or someone) can reach out to those users for whom the tool is too slow.
We are working on a multithreaded tool, to help a larger user set. Our goal with this release is to let people make forward progress while we improve the tools.
Lists of affected files will be available on the SCF, using the same naming convention, in the next few days.
/usr/bin/find-file-size-mismatches will be there as well, and work in the same way.
We will gather data on affected files again in the future. This will let us know what files haven’t yet been addressed, so we can update users.
Since the size-mismatch issue does not affect data integrity, we do not need to fix every file. We just need to fix most of them to reduce wasted space and unnecessary workload on the Lustre servers.