This module contains wrappers with sklearn interface for different machine learning libraries (TMVA, sklearn, XGBoost).
At first we defined some interface for classification and regressors wrappers, so you can add your own wrappers for another libraries following this interface.
Sklearn wrapper is the same sklearn model, but it operates with pandas.DataFrame data (also supports numpy.ndarray) and can choose just those features, that user pointed in the constructor.
There are interfaces for classification and regression wrappers.
Sklearn wrapper for users is the same as sklearn model, just has one additional parameter features to choose necessary columns for training. If data has numpy.array type then behaviour will be the same as in sklear.
These classes are wrappers for physics machine learning library TMVA used .root format files (c++ library). Now you can simply use it in python. TMVA contains classification and regression algorithms, including neural networks.
TMVA Guide: http://mirror.yandex.ru/gentoo-distfiles/distfiles/TMVAUsersGuide-v4.03.pdf .. automodule:: rep.estimators.tmva
members: show-inheritance: undoc-members:
>>> from sklearn import datasets
>>> import pandas, numpy
>>> from rep.utils import train_test_split
>>> from sklearn.metrics import roc_auc_score
>>> # iris data
>>> iris = datasets.load_iris()
>>> data = pandas.DataFrame(iris.data, columns=['a', 'b', 'c', 'd'])
>>> labels = iris.target
>>> # Take just two classes instead of three
>>> data = data[labels != 2]
>>> labels = labels[labels != 2]
>>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
>>> from rep.estimators import SklearnClassifier
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> # Using gradient boosting with default settings
>>> sk = SklearnClassifier(GradientBoostingClassifier(), features=['a', 'b'])
>>> # Training classifier
>>> sk.fit(train_data, train_labels)
>>> pred = sk.predict_proba(test_data)
>>> print pred
[[ 9.99842983e-01 1.57016893e-04]
[ 1.45163843e-04 9.99854836e-01]
[ 9.99842983e-01 1.57016893e-04]
[ 9.99827693e-01 1.72306607e-04], ..]
>>> roc_auc_score(test_labels, pred[:, 1])
0.99768518518518523
>>> from rep.estimators import TMVAClassifier
>>> tmva = TMVAClassifier(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=['a', 'b'])
>>> tmva.fit(train_data, train_labels)
>>> pred = tmva.predict_proba(test_data)
>>> print pred
[[ 9.99991025e-01 8.97546346e-06]
[ 1.14084636e-04 9.99885915e-01]
[ 9.99991009e-01 8.99060302e-06]
[ 9.99798700e-01 2.01300452e-04], ..]
>>> roc_auc_score(test_labels, pred[:, 1])
0.99999999999999989
>>> from rep.estimators import XGBoostClassifier
>>> # XGBoost with default parameters
>>> xgb = XGBoostClassifier(features=['a', 'b'])
>>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
>>> pred = xgb.predict_proba(test_data)
>>> print pred
[[ 0.9983651 0.00163494]
[ 0.00170585 0.99829417]
[ 0.99845636 0.00154361]
[ 0.96618336 0.03381656], ..]
>>> roc_auc_score(test_labels, pred[:, 1])
0.99768518518518512
>>> from sklearn import datasets
>>> from sklearn.metrics import mean_squared_error
>>> from rep.utils import train_test_split
>>> import pandas, numpy
>>> # diabetes data
>>> diabetes = datasets.load_diabetes()
>>> features = ['feature_%d' % number for number in range(diabetes.data.shape[1])]
>>> data = pandas.DataFrame(diabetes.data, columns=features)
>>> labels = diabetes.target
>>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
>>> from rep.estimators import SklearnRegressor
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> # Using gradient boosting with default settings
>>> sk = SklearnRegressor(GradientBoostingRegressor(), features=features[:8])
>>> # Training classifier
>>> sk.fit(train_data, train_labels)
>>> pred = sk.predict(train_data)
>>> numpy.sqrt(mean_squared_error(train_labels, pred))
60.666009962879265
>>> from rep.estimators import TMVARegressor
>>> tmva = TMVARegressor(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=features[:8])
>>> tmva.fit(train_data, train_labels)
>>> pred = tmva.predict(test_data)
>>> numpy.sqrt(mean_squared_error(test_labels, pred))
73.74191838418254
>>> from rep.estimators import XGBoostRegressor
>>> # XGBoost with default parameters
>>> xgb = XGBoostRegressor(features=features[:8])
>>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
>>> pred = xgb.predict(test_data)
>>> numpy.sqrt(mean_squared_error(test_labels, pred))
65.557743652940133