clophfit.fitting.utils#
Utility functions for fitting modules.
Functions#
|
Parse outlier specification |
|
Identify outliers using the Z-score method on a 1D array of residuals. |
|
Update y_errc in a Dataset from the mean absolute residuals of each label. |
|
Flag outliers using robust Theil-Sen regression of y on x. |
|
Fit a robust Theil-Sen regression line. |
|
Calculate the smoothness of a curve. |
|
Calculate the roughness of a curve. |
|
Compute outlier scores for each point using geometric deviation. |
|
Mask outlier points iteratively in each DataArray of a Dataset. |
|
Estimate proportional error (alpha) per label via moment estimator. |
|
Fit heteroscedastic noise model via non-negative least squares. |
|
Fit per-label noise model parameters from first-pass residuals. |
|
Fit gain and rel_error per label from residuals with known floor. |
|
Compute |dS/dpH| for the Henderson-Hasselbalch equation. |
|
Compute per-well per-label |
|
Fit global |
Module Contents#
- clophfit.fitting.utils.parse_remove_outliers(spec)#
Parse outlier specification
"method:threshold:min_keep".- Parameters:
spec (str) – The string to parse.
- Returns:
A tuple of method, threshold, min_keep.
- Return type:
tuple[str, float, int]
Examples
"zscore:2.5:5"-> (“zscore”, 2.5, 5)"method"-> (“method”, 2.0, 1)
- clophfit.fitting.utils.identify_outliers_zscore(residuals, threshold=2.0)#
Identify outliers using the Z-score method on a 1D array of residuals.
- Parameters:
residuals (np.ndarray) – The residuals to analyze.
threshold (float) – The Z-score threshold beyond which a point is considered an outlier.
- Returns:
A boolean mask where True indicates an outlier.
- Return type:
ArrayMask
- clophfit.fitting.utils.reweight_from_residuals(ds, residuals)#
Update y_errc in a Dataset from the mean absolute residuals of each label.
- clophfit.fitting.utils.flag_trend_outliers(x, y, threshold=3.0)#
Flag outliers using robust Theil-Sen regression of y on x.
A point is flagged if its residual is far from the trendline (Z-score < -threshold) OR if its x-value is extremely low compared to the population (Z-score < -threshold).
- Parameters:
x (pd.Series) – The independent variable (e.g., maximum signal, mean).
y (pd.Series) – The dependent variable (e.g., signal span, std, or dynamic range).
threshold (float) – The Z-score threshold for flagging an outlier.
- Returns:
A boolean Series of the same length as x, True for outliers.
- Return type:
pd.Series
- clophfit.fitting.utils.fit_trendline(x, y)#
Fit a robust Theil-Sen regression line.
- Parameters:
x (pd.Series) – The independent variable.
y (pd.Series) – The dependent variable.
- Returns:
Slope and intercept.
- Return type:
tuple[float, float]
- clophfit.fitting.utils.smoothness(y)#
Calculate the smoothness of a curve.
Sum of |consecutive diffs| / total span. = 1 for perfectly monotone, > 1 for noisy/non-monotone.
- Parameters:
y (np.ndarray) – The signal array.
- Returns:
The smoothness value.
- Return type:
float
- clophfit.fitting.utils.roughness(y)#
Calculate the roughness of a curve.
Excess path fraction: 0 = perfectly monotone, 1 = all noise, flat-safe. roughness = (consec - span) / consec.
- Parameters:
y (np.ndarray) – The signal array.
- Returns:
The roughness value.
- Return type:
float
- clophfit.fitting.utils.outlier_scores_extended(x, y)#
Compute outlier scores for each point using geometric deviation.
Uses a hybrid approach for edge points: - If edge_step > 2 * local_step: anomalously large jump → use full projection deviation - Elif wrong direction (reversal): use projection deviation - Else (correct direction / plateau approach): score = 0
For internal points: triangle inequality score.
- Parameters:
x (np.ndarray) – x-values (e.g. pH or concentration).
y (np.ndarray) – Observed y-values.
- Returns:
Per-point outlier scores (non-negative; higher = more anomalous).
- Return type:
np.ndarray
Examples
>>> import numpy as np >>> x = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) >>> y = np.array([10.0, 8.0, 15.0, 4.0, 2.0]) >>> scores = outlier_scores_extended(x, y) >>> bool(scores[2] > 0.4) True
- clophfit.fitting.utils.apply_outlier_mask(ds, threshold=0.2, min_keep=3)#
Mask outlier points iteratively in each DataArray of a Dataset.
Removes the single worst outlier (if above threshold) and recomputes scores, repeating until no score exceeds the threshold or fewer than min_keep unmasked points remain.
- Parameters:
ds (Dataset) – Dataset to process (deep-copied; input is not modified).
threshold (float, optional) – Outlier score above which a point is masked. Default is 0.2.
min_keep (int, optional) – Minimum number of unmasked points to retain. Default is 3.
- Returns:
A new Dataset with outlier points masked.
- Return type:
- clophfit.fitting.utils.fit_rel_error_from_residuals(df, sigma_floor)#
Estimate proportional error (alpha) per label via moment estimator.
Assumes the simplified noise model
sigma^2 = floor^2 + alpha^2 * that^2(no Poisson gain term). Withfloorknown from buffer measurements and using model-predicted valuesthatin the denominator to avoid noise-in-variables bias, the closed-form moment estimator is:\[\hat{\alpha}^2 = \frac{\overline{r^2} - \sigma_{\text{floor}}^2}{\overline{\hat{y}^2}}\]- Parameters:
df (pd.DataFrame) – DataFrame with columns
label(str),resid_raw(float), andpredicted(float – the model-predicted signal at each point). Typically fromclophfit.fitting.residuals.collect_multi_residuals().sigma_floor (dict[str, float]) – Known read-noise floor per label, e.g. from
tit.bg_noise.
- Returns:
Per-label proportional error estimate
alpha(non-negative).- Return type:
dict[str, float]
Examples
>>> import numpy as np, pandas as pd >>> rng = np.random.default_rng(0) >>> y_pred = np.linspace(50, 500, 200) >>> floor, true_alpha = 5.0, 0.02 >>> sigma = np.sqrt(floor**2 + (true_alpha * y_pred) ** 2) >>> resid = sigma * rng.standard_normal(200) >>> df = pd.DataFrame({"label": "1", "resid_raw": resid, "predicted": y_pred}) >>> alpha = fit_rel_error_from_residuals(df, sigma_floor={"1": floor}) >>> round(alpha["1"], 2) # should be close to true_alpha=0.02 0.02
- clophfit.fitting.utils.fit_noise_model_nnls(df, sigma_floor_fixed=None, rel_error_fixed=None)#
Fit heteroscedastic noise model via non-negative least squares.
Model: \(\sigma^2 = \sigma_\text{floor}^2 + \text{gain} \cdot y + \alpha^2 \cdot y^2\)
Uses
scipy.optimize.nnls()to enforce non-negativity on all parameters, which stabilises estimates when \(y\) and \(y^2\) are highly collinear (typical for narrow-range titrations).- Parameters:
df (pd.DataFrame) – Residual DataFrame with columns
label,resid_raw,predicted.sigma_floor_fixed (dict[str, float] | None) – If given, fix floor per label and only fit gain and alpha.
rel_error_fixed (dict[str, float] | None) – If given, fix alpha per label and only fit floor and gain.
- Returns:
(sigma_floor, gain, alpha)per label — all non-negative.- Return type:
tuple[dict[str, float], dict[str, float], dict[str, float]]
- Raises:
ValueError – If both sigma_floor_fixed and rel_error_fixed are provided.
- clophfit.fitting.utils.fit_noise_model_from_residuals(df, rel_error=0.003)#
Fit per-label noise model parameters from first-pass residuals.
With
rel_errorfixed, the noise equation becomes linear in two unknowns via non-negative least squares.- Parameters:
df (pd.DataFrame) – DataFrame with columns
label,resid_raw,predicted.rel_error (float | dict[str, float], optional) – Fixed proportional error. A single float is broadcast to all labels. Default is 0.003.
- Returns:
(sigma_floor_dict, gain_dict)per label (non-negative).- Return type:
tuple[dict[str, float], dict[str, float]]
- clophfit.fitting.utils.fit_gain_and_rel_error_from_residuals(df, sigma_floor)#
Fit gain and rel_error per label from residuals with known floor.
Uses non-negative least squares on
r^2 - floor^2 = gain * y + alpha^2 * y^2to handle collinearity between \(y\) and \(y^2\).- Parameters:
df (pd.DataFrame) – DataFrame with columns
label,resid_raw,predicted.sigma_floor (dict[str, float]) – Known noise floor per label.
- Returns:
(gain_dict, rel_error_dict)per label (non-negative).- Return type:
tuple[dict[str, float], dict[str, float]]
- clophfit.fitting.utils.compute_binding_slope(ph, pka, s0, s1)#
Compute |dS/dpH| for the Henderson-Hasselbalch equation.
dS/dpH = (s1 - s0) * ln(10) * t / (1 + t)^2wheret = 10^(pka - ph). Returns the absolute value (sign irrelevant for variance).- Parameters:
ph (numpy.ndarray)
pka (float)
s0 (float)
s1 (float)
- Return type:
numpy.ndarray
- clophfit.fitting.utils.compute_plate_slopes(results)#
Compute per-well per-label
∂S/∂pHfrom pass-1 fit results.- Parameters:
results (dict[str, Any]) – Fit results keyed by well (must have
.resultand.dataset).- Returns:
{well: {label: slope_array}}.- Return type:
dict[str, dict[str, np.ndarray]]
- clophfit.fitting.utils.fit_ph_slope_noise(df, noise_model, plate_slopes)#
Fit global
sigma_phfrom excess variance after per-label model.After subtracting the per-label noise model variance, the leftover
r^2 - var_modelis regressed against(dS/dpH)^2via NNLS.- Parameters:
df (pd.DataFrame) – Residual DataFrame with columns
label,well,resid_raw,predicted, andraw_i.noise_model (PlateNoiseModel) – Per-label noise model (floor, gain, alpha) fitted in the same pass.
plate_slopes (dict[str, dict[str, np.ndarray]]) – Per-well per-label derivative
|dS/dpH|arrays.
- Returns:
Global
sigma_phestimate (>= 0).- Return type:
float