Estimators (classification and regression)

This module contains wrappers with sklearn interface for different machine learning libraries (TMVA, sklearn, XGBoost).

At first we defined some interface for classification and regressors wrappers, so you can add your own wrappers for another libraries following this interface.

Sklearn wrapper is the same sklearn model, but it operates with pandas.DataFrame data (also supports numpy.ndarray) and can choose just those features, that user pointed in the constructor.

Estimators interfaces (for classification and regression)

There are interfaces for classification and regression wrappers.

Sklearn classifier and regressor

Sklearn wrapper for users is the same as sklearn model, just has one additional parameter features to choose necessary columns for training. If data has numpy.array type then behaviour will be the same as in sklear.

TMVA classifier and regressor

These classes are wrappers for physics machine learning library TMVA used .root format files (c++ library). Now you can simply use it in python. TMVA contains classification and regression algorithms, including neural networks.

TMVA Guide: http://mirror.yandex.ru/gentoo-distfiles/distfiles/TMVAUsersGuide-v4.03.pdf .. automodule:: rep.estimators.tmva

members:
show-inheritance:
 
undoc-members:

XGBoost classifier and regressor

Examples

Classification

  • Prepare dataset
    >>> from sklearn import datasets
    >>> import pandas, numpy
    >>> from rep.utils import train_test_split
    >>> from sklearn.metrics import roc_auc_score
    >>> # iris data
    >>> iris = datasets.load_iris()
    >>> data = pandas.DataFrame(iris.data, columns=['a', 'b', 'c', 'd'])
    >>> labels = iris.target
    >>> # Take just two classes instead of three
    >>> data = data[labels != 2]
    >>> labels = labels[labels != 2]
    >>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
    
  • Sklearn classification
    >>> from rep.estimators import SklearnClassifier
    >>> from sklearn.ensemble import GradientBoostingClassifier
    >>> # Using gradient boosting with default settings
    >>> sk = SklearnClassifier(GradientBoostingClassifier(), features=['a', 'b'])
    >>> # Training classifier
    >>> sk.fit(train_data, train_labels)
    >>> pred = sk.predict_proba(test_data)
    >>> print pred
    [[  9.99842983e-01   1.57016893e-04]
     [  1.45163843e-04   9.99854836e-01]
     [  9.99842983e-01   1.57016893e-04]
     [  9.99827693e-01   1.72306607e-04], ..]
    >>> roc_auc_score(test_labels, pred[:, 1])
    0.99768518518518523
    
  • TMVA classification
    >>> from rep.estimators import TMVAClassifier
    >>> tmva = TMVAClassifier(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=['a', 'b'])
    >>> tmva.fit(train_data, train_labels)
    >>> pred = tmva.predict_proba(test_data)
    >>> print pred
    [[  9.99991025e-01   8.97546346e-06]
     [  1.14084636e-04   9.99885915e-01]
     [  9.99991009e-01   8.99060302e-06]
     [  9.99798700e-01   2.01300452e-04], ..]
    >>> roc_auc_score(test_labels, pred[:, 1])
    0.99999999999999989
    
  • XGBoost classification
    >>> from rep.estimators import XGBoostClassifier
    >>> # XGBoost with default parameters
    >>> xgb = XGBoostClassifier(features=['a', 'b'])
    >>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
    >>> pred = xgb.predict_proba(test_data)
    >>> print pred
    [[ 0.9983651   0.00163494]
     [ 0.00170585  0.99829417]
     [ 0.99845636  0.00154361]
     [ 0.96618336  0.03381656], ..]
    >>> roc_auc_score(test_labels, pred[:, 1])
    0.99768518518518512
    

Regerssion

  • Prepare dataset
    >>> from sklearn import datasets
    >>> from sklearn.metrics import mean_squared_error
    >>> from rep.utils import train_test_split
    >>> import pandas, numpy
    >>> # diabetes data
    >>> diabetes = datasets.load_diabetes()
    >>> features = ['feature_%d' % number for number in range(diabetes.data.shape[1])]
    >>> data = pandas.DataFrame(diabetes.data, columns=features)
    >>> labels = diabetes.target
    >>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
    
  • Sklearn regression
    >>> from rep.estimators import SklearnRegressor
    >>> from sklearn.ensemble import GradientBoostingRegressor
    >>> # Using gradient boosting with default settings
    >>> sk = SklearnRegressor(GradientBoostingRegressor(), features=features[:8])
    >>> # Training classifier
    >>> sk.fit(train_data, train_labels)
    >>> pred = sk.predict(train_data)
    >>> numpy.sqrt(mean_squared_error(train_labels, pred))
    60.666009962879265
    
  • TMVA regression
    >>> from rep.estimators import TMVARegressor
    >>> tmva = TMVARegressor(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=features[:8])
    >>> tmva.fit(train_data, train_labels)
    >>> pred = tmva.predict(test_data)
    >>> numpy.sqrt(mean_squared_error(test_labels, pred))
    73.74191838418254
    
  • XGBoost regression
    >>> from rep.estimators import XGBoostRegressor
    >>> # XGBoost with default parameters
    >>> xgb = XGBoostRegressor(features=features[:8])
    >>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
    >>> pred = xgb.predict(test_data)
    >>> numpy.sqrt(mean_squared_error(test_labels, pred))
    65.557743652940133