Determine the number of nodes to use in a new cluster — optimalClusterNumGeneralized • pemisc

Optimally determine the number of cores to use to set up a new cluster, based on:

the number of cores available (see note);
the amount of free memory available on the local machine;
the number of cores requested vs. the number available, such that if requesting more cores than available, the number of cores used will be adjusted to be a multiple of the number of cores needed, so jobs can be run in approximately-even-sized batches. (E.g., if 16 cores available but need 50, the time taken to run 3 batches of 16 plus a single batch of 2 -- i.e., 4 batches total -- is the same as running 4 batches of 13.)

optimalClusterNumGeneralized(
  memRequiredMB = 500,
  maxNumClusters = parallel::detectCores(),
  NumCoresAvailable = parallel::detectCores(),
  availMem = pemisc::availableMemory()/1e+06
)

optimalClusterNum(
  memRequiredMB = 500,
  maxNumClusters = parallel::detectCores()
)

Arguments

memRequiredMB: The amount of memory needed in MB
maxNumClusters: The number of nodes needed (requested)
NumCoresAvailable: The number of cores available on the local machine (see note).
availMem: The amount of free memory (RAM) available to use.

Value

integer specifying the number of cores

Note

R hardcodes the maximum number of socket connections it can use (currently set to 128 in R 4.1). Three of these are reserved for the main R process, so practically speaking, a user can create at most 125 connections e.g., when creating a cluster. See https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28.

We limit this a bit further here just in case the user already has open connections.