Implementation#
Overview#
For details on the mathematical background of this implementation, please refer to the Cellamare et al. (2022) paper. Here, we will not describe the mathematical background, but rather the flow of the algorithm.
The GLM algorithm is executed iteratively. After initialization, the algorithm computes partial beta values on each node. These partial beta values are then aggregated on the central node. With the overall beta values, each node can compute the local deviance. Next, the central node computes the global deviance. If the deviance is virtually the same as the previous iteration, the algorithm stops. Otherwise, the algorithm again requests the nodes to compute the partial beta values, etc.
Partials#
Partials are the computations that are executed on each node. The partials have access to the data that is stored on the node. The partials are executed in parallel on each node.
compute_local_betas#
This function iteratively computes the partial beta values on each node. To achieve this, the following steps are executed:
The data is loaded and the design matrix is computed using the formula
Several privacy checks are executed - see the privacy guards section for more information.
The eta values are computed using the overall beta values provided by the central function. In the first iteration, the family’s link function is used to compute the eta values.
The mu, z and W values are computed using the eta values.
These values are used to compute the new partial beta values.
The partial beta values are returned to the central function, alongside some metadata, being the dispersion, number of observations, number of variables and the sum of the outcome variable.
compute_local_deviance#
The local deviance function computes the deviance on each node. The deviance is computed using the overall beta values provided by the central function. The following steps are executed:
The data is loaded and the design matrix is computed using the formula
Several privacy checks are executed - see the privacy guards section for more information. Note that these should not yield different results than the checks in the compute_local_betas function - unless the data provided to the node has changed in the meantime (for instance, if the node was restarted).
The eta values are computed using the overall beta values provided by the central function. The central function provides the betas from the previous iteration as well as the current iteration. These are used to compute the old and new eta values.
The mu values are computed using the eta values, for the old and new eta values.
The local deviance is computed using the mu values and the outcome variable.
The null deviance is computed using the global average of the outcome variable.
The local deviance of the current iteration, the previous iteration, and the local null deviance are returned to the central function.
get_categorical_levels#
This function is used to collect all the existing categorical levels from the nodes. This is done prior to any computation of betas or deviances. All levels of the categorical variables are used in the computation of the betas and deviances to ensure that the resulting matrices are compatible and can easily be aggregated.
Central (glm)#
The central part is responsible for the orchestration and aggregation of the algorithm. It executes the following steps:
Collect organizations in collaboration.
- Start an iteration, which consists of the following steps:
Create partial task to compute local betas.
Collect the partial beta results.
Compute the overall beta values. Also, compute the overall dispersion, number of observations, number of variables, and the average of the outcome variable.
Create new partial tasks to compute the local deviance.
Collect the partial deviance results.
Compute the overall deviance. This is simply the sum of the local deviances.
If the deviance changes very little (below the tolerance threshold), the algorithm has converged. If the algorithm has converged or the maximum number of iterations has been reached, the algorithm stops. Otherwise, start a new iteration.
Use the final overall beta values to compute standard errors, Z-values and p-values.
Return the overall beta values together with the standard errors, Z-values, and p-values. Also, return the dispersion, number of observations, number of variables, number of iterations, deviance, null deviance, and whether the algorithm has converged or not.