Utilities

Different helpful functions, objects, methods are collected here.

class rep.utils.Binner(values, bins_number)
bins_number()
get_bins(values)
get_bins_dumb(values)

This is the sane as previous function, but a bit slower and naive

set_limits(limits)
split_into_bins(*arrays)

Splits the data of parallel arrays into bins, the first array is binning variable

class rep.utils.Flattener(data, sample_weight=None)

Prepares normalization function for some set of values transforms it to uniform distribution from [0, 1]. Example of usage:

Parameters:
  • data (list or numpy.array) – predictions
  • sample_weight (None or list or numpy.array) – weights
>>> normalizer = Flattener(signal)
>>> hist(normalizer(background))
>>> hist(normalizer(signal))
Return func:normalization function
rep.utils.calc_ROC(prediction, signal, sample_weight=None, max_points=10000)

Calculate roc curve

Parameters:
  • prediction (array or list) – predictions
  • signal (array or list) – true labels
  • sample_weight (None or array or list) – weights
  • max_points (int) – maximum of used points on roc curve
Returns:

(tpr, tnr), (err_tnr, err_tpr), thresholds

rep.utils.calc_feature_correlation_matrix(df)

Calculate correlation matrix

Parameters:df (pandas.DataFrame) – data
Returns:correlation matrix for dataFrame
Return type:numpy.ndarray
rep.utils.calc_hist_with_errors(x, weight=None, bins=60, normed=True, x_range=None, ignored_sideband=0.0)

Calculate data for error bar (for plot pdf with errors)

Parameters:
  • x (list or numpy.array) – data
  • weight (None or list or numpy.array) – weights
Returns:

tuple (x-points (list), y-points (list), y points errors (list), x points errors (list))

rep.utils.check_sample_weight(y_true, sample_weight)

Checks the weights, returns normalized version

rep.utils.get_columns_dict(columns)

Get (new column: old column) dict expressions

Parameters:columns (list[str]) – columns names
Return type:dict
rep.utils.get_columns_in_df(df, columns)

Get columns in data frame using numexpr evaluation

Parameters:
  • df (pandas.DataFrame) – data
  • columns – necessary columns
  • columns – None or list[str]
Returns:

data frame with pointed columns

rep.utils.get_efficiencies(prediction, spectator, sample_weight=None, bins_number=20, thresholds=None, errors=False, ignored_sideband=0.0)

Construct efficiency function dependent on spectator for each threshold

Different score functions available: Efficiency, Precision, Recall, F1Score, and other things from sklearn.metrics

Parameters:
  • prediction – list of probabilities
  • spectator – list of spectator’s values
  • bins_number – int, count of bins for plot
  • thresholds

    list of prediction’s threshold

    (default=prediction’s cuts for which efficiency will be [0.2, 0.4, 0.5, 0.6, 0.8])

Returns:

if errors=False OrderedDict threshold -> (x_values, y_values)

if errors=True OrderedDict threshold -> (x_values, y_values, y_err, x_err)

All the parts: x_values, y_values, y_err, x_err are numpy.arrays of the same length.

rep.utils.reorder_by_first(*arrays)

Applies the same permutation to all passed arrays, permutation sorts the first passed array

rep.utils.train_test_split(*arrays, **kw_args)

Does the same thing as train_test_split, but preserves columns in DataFrames. Uses the same parameters: test_size, train_size, random_state, and has the same interface

Parameters:arrays (list[numpy.array] or list[pandas.DataFrame]) – arrays to split
rep.utils.weighted_percentile(array, percentiles, sample_weight=None, array_sorted=False, old_style=False)

Previous topic

Plotting

This Page