1. A method comprising: gathering, by a processing element of a scheduler node, information about a cluster of nodes in a high-performance computing system, wherein the high-performance computing system is in a production state with one or more computational workloads getting executed thereon; periodically sending, by the processing element, one or more test-computing jobs for execution on each node, of the cluster of nodes, to measure one or more performance metrics thereof; receiving, by the processing element, measured performance metrics from each node in response to the one or more test-computing jobs executed thereon; recording, by the processing element, in a database, the measured performance metrics received from each node, wherein recording the measured performance metrics comprises: determining, by the processing element, whether to update the database by comparing the measured performance metrics of a current instance with the performance metrics recorded in the database at a previous instance, for each node of the cluster of nodes: and in response to determining, based on the comparison, a change in the performance metrics, updating, by the processing element, the database with the measured performance metrics of the current instance; receiving a request to run one or more computational jobs on the high-performance computing system; selecting, by the processing element based on the received request and the measured performance metrics recorded in the database, a set of nodes from the cluster of nodes for running the requested one or more computational jobs on the high-performance computing system; and sorting, by the processing element, the cluster of nodes in the fastest to slowest order of an actual processing speed based on a performance metric selected from the measured performance metrics.
|