Utilities¶

Different helpful functions, objects, methods are collected here.

class rep.utils.Binner(values, bins_number)¶

bins_number()¶

get_bins(values)¶

get_bins_dumb(values)¶: This is the sane as previous function, but a bit slower and naive

set_limits(limits)¶

split_into_bins(*arrays)¶: Splits the data of parallel arrays into bins, the first array is binning variable

class rep.utils.Flattener(data, sample_weight=None)¶

Prepares normalization function for some set of values transforms it to uniform distribution from [0, 1]. Example of usage:

Parameters:	data (list or numpy.array) – predictions sample_weight (None or list or numpy.array) – weights

>>> normalizer = Flattener(signal)
>>> hist(normalizer(background))
>>> hist(normalizer(signal))

Return func:	normalization function

rep.utils.calc_ROC(prediction, signal, sample_weight=None, max_points=10000)¶

Calculate roc curve

Parameters:	prediction (array or list) – predictions signal (array or list) – true labels sample_weight (None or array or list) – weights max_points (int) – maximum of used points on roc curve
Returns:	(tpr, tnr), (err_tnr, err_tpr), thresholds

rep.utils.calc_feature_correlation_matrix(df)¶

Calculate correlation matrix

Parameters:	df (pandas.DataFrame) – data
Returns:	correlation matrix for dataFrame
Return type:	numpy.ndarray

rep.utils.calc_hist_with_errors(x, weight=None, bins=60, normed=True, x_range=None, ignored_sideband=0.0)¶

Calculate data for error bar (for plot pdf with errors)

Parameters:	x (list or numpy.array) – data weight (None or list or numpy.array) – weights
Returns:	tuple (x-points (list), y-points (list), y points errors (list), x points errors (list))

rep.utils.check_sample_weight(y_true, sample_weight)¶: Checks the weights, returns normalized version

rep.utils.get_columns_dict(columns)¶

Get (new column: old column) dict expressions

Parameters:	columns (list[str]) – columns names
Return type:	dict

rep.utils.get_columns_in_df(df, columns)¶

Get columns in data frame using numexpr evaluation

Parameters:	df (pandas.DataFrame) – data columns – necessary columns columns – None or list[str]
Returns:	data frame with pointed columns

rep.utils.get_efficiencies(prediction, spectator, sample_weight=None, bins_number=20, thresholds=None, errors=False, ignored_sideband=0.0)¶

Construct efficiency function dependent on spectator for each threshold

Different score functions available: Efficiency, Precision, Recall, F1Score, and other things from sklearn.metrics

Parameters:

prediction – list of probabilities
spectator – list of spectator’s values
bins_number – int, count of bins for plot
thresholds –
list of prediction’s threshold

(default=prediction’s cuts for which efficiency will be [0.2, 0.4, 0.5, 0.6, 0.8])

Returns:

if errors=False OrderedDict threshold -> (x_values, y_values)

if errors=True OrderedDict threshold -> (x_values, y_values, y_err, x_err)

All the parts: x_values, y_values, y_err, x_err are numpy.arrays of the same length.

rep.utils.reorder_by_first(*arrays)¶: Applies the same permutation to all passed arrays, permutation sorts the first passed array

rep.utils.train_test_split(*arrays, **kw_args)¶

Does the same thing as train_test_split, but preserves columns in DataFrames. Uses the same parameters: test_size, train_size, random_state, and has the same interface

Parameters:	arrays (list[numpy.array] or list[pandas.DataFrame]) – arrays to split

rep.utils.weighted_percentile(array, percentiles, sample_weight=None, array_sorted=False, old_style=False)¶

Utilities¶

Previous topic

This Page

Navigation

Utilities¶

Previous topic

This Page

Quick search

Navigation