Bad disk troubleshooting

When a disk failure occurs, it can be checked by the following methods:

  • In the Replica Server logs, an IO error was found for a certain disk
  • It is possible that the latency of a certain server is significantly higher than that of other servers. Continuing to investigate, if it is found that the IO wait of a certain disk is significantly higher, it basically proves that the disk is a slow disk

Bad disk blacklist

Pegasus supports disk black list, if you want to take a bad disk offline, firstly, define it in the disk black list file on the Replica Server where it is located, the file path is determined by the configuration:

    data_dirs_black_list_file = /home/work/.pegasus_data_dirs_black_list

Then log in to the corresponding server and edit the file, for example, disable ssd2 and ssd3:


Restart service

After marking the black list of bad disks, a restart is required to take effect. It is recommended to restart the Replica Server process on the corresponding server through High availability restart steps.

After restarting, the following records can be found in the server log, indicating that the disks marked in the black list have taken effect:

data_dirs_black_list_file[/home/work/.pegasus_data_dirs_black_list] found, apply it
black_list[1] = [/home/work/ssd2/]
black_list[2] = [/home/work/ssd3/]
