AutoML:mljar-supervised

2020-12-21 02:41#1 标记1

mljar-supervised是一个自动化机器学习 Python 软件包。它旨在为数据科学家节省时间。它抽象了数据预处理，构建机器学习模型和执行超参数调整以找到最佳模型的通用方法。它不是黑盒子，因为您可以确切地看到 ML 管道的构造方式（每个 ML 模型都有详细的 Markdown 报告）。
mljar-supervised会帮助你：
解释和理解您的数据，
尝试多种机器学习模型，
通过分析创建有关所有模型的详细信息的 Markdown 报告，
保存，重新运行和加载分析 ML 模型。
它具有三种内置的工作模式：
Explain 模式，解释和理解数据的理想选择，它具有许多数据解释，例如决策树可视化，线性模型系数显示，排列重要性和数据的 SHAP 解释，
Perform 模式，用于构建用于生产的 ML 管道，
Compete 模式，用于训练经过整合和堆叠的高度优化的 ML 模型，目的是用于 ML 竞赛。
当然，您可以进一步自定义每个细节mode以满足要求。
有什么好处？
采用多种算法：Baseline，Linear，Random Forest，Extra Trees，LightGBM，Xgboost，CatBoost，Neural Networks，和Nearest Neighbors。
可以进行特征预处理，例如：缺失值插补和转换类别。而且，它还可以进行目标值预处理（您不会相信需要多长时间！）。例如，将分类目标转换为数字。
可以使用not-so-random-search算法（对定义的一组值进行随机搜索）和爬坡来微调超参数，以微调最终模型。
可以为您的数据计算Baseline。因此，您将知道是否需要机器学习！以及与Baseline相比，您的 ML 模型有多好。Baseline但计算基于先验的类别分布分类，和简单的平均回归计算。
使用参数max_depth <= 5训练Decision Trees，并使用dtreeviz[1]更好地了解您的数据。
将mljar-supervised在总结报告使用简单线性回归，包括它的系数，这样你就可以检查哪些特征使用线性模型中最佳。
可以基于贪婪算法计算 Ensemble，参考Caruana 论文中[2] 。
可以堆叠模型以构建 2 级集成（在Compete模式下或设置stack_models参数后可用）。
关心模型的可解释性：对于每种算法，特征重要性都是基于置换来计算的。此外，对于每种算法，都会计算 SHAP 解释：特征重要性，依存关系图和决策图（可以通过explain_level参数关闭解释）。
mljar-supervised 通过 AutoML 培训创建的 markdown 报告，其中包含 ML 详细信息和图表。
有可用的黄金特征[3]算法和可以与任何 ML 算法一起使用的特征选择[4]。
可用模式
在AutoML 模式[5]中您可以找到表中提供的有关 AutoML 模式的详细信息。
解释
automl = AutoML(mode="Explain")
它旨在在用户想要解释和理解数据时使用。
按 75％/25％比例拆分数据集为训练（测）试集。
它使用：Baseline，Linear，Decision Tree，Random Forest，Xgboost，Neural Network算法和合奏。
它有完整的解释：学习曲线，重要性图和 SHAP 图。
执行
automl = AutoML(mode="Perform")
当用户想要训练将在实际用例中使用的模型时，应使用它。
它使用 5 倍 CV。
它使用：Linear，Random Forest，LightGBM，Xgboost，CatBoost和Neural Network。它使用合奏。
它在报告中具有学习曲线和重要性图。
竞争
automl = AutoML(mode="Compete")
可用于机器学习比赛。
它使用 10 倍 CV。
它使用：Linear，Decision Tree，Random Forest，Extra Trees，LightGBM，Xgboost，CatBoost，Neural Network和Nearest Neighbors。它使用集成和堆栈。
报告中只有学习曲线。
二进制分类示例
fit和predict方法有一个简单的接口。
import pandas as pdfrom sklearn.model_selection import train_test_splitfrom supervised.automl import AutoMLdf = pd.read_csv(    "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv",    skipinitialspace=True,)X_train, X_test, y_train, y_test = train_test_split(    df[df.columns[:-1]], df["income"], test_size=0.25)automl = AutoML()automl.fit(X_train, y_train)predictions = automl.predict(X_test)
输出
Create directory AutoML_1AutoML task to be solved: binary_classificationAutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']AutoML will optimize for metric: logloss1_Baseline final logloss 0.5519845471086654 time 0.08 seconds2_DecisionTree final logloss 0.3655910192804364 time 10.28 seconds3_Linear final logloss 0.38139916864708445 time 3.19 seconds4_Default_RandomForest final logloss 0.2975204390214936 time 79.19 seconds5_Default_Xgboost final logloss 0.2731086827200411 time 5.17 seconds6_Default_NeuralNetwork final logloss 0.319812276905242 time 21.19 secondsEnsemble final logloss 0.2731086821194617 time 1.43 seconds
Markdown 报告中[6]的 AutoML 结果
Xgboost Markdown 报告[7]，请查看 SHAP 软件包产生的惊人的依赖图
决策树降价报告[8]，请查看美丽的树可视化图
的 Logistic 回归Markdown 报告中[9]，请查看系数表，您可以比较（Xgboost，Decision Tree 和 Logistic Regression）之间的 SHAP 图
多类别分类示例
手写数字数据集光学识别分类的示例代码。在不到 30 分钟的时间内运行此代码将导致〜98％的测试准确性。
import pandas as pd# scikit learn utilitesfrom sklearn.datasets import load_digitsfrom sklearn.metrics import accuracy_scorefrom sklearn.model_selection import train_test_split# mljar-supervised packagefrom supervised.automl import AutoML# load the datadigits = load_digits()X_train, X_test, y_train, y_test = train_test_split(    pd.DataFrame(digits.data), digits.target, stratify=digits.target, test_size=0.25,    random_state=123)# train models with AutoMLautoml = AutoML(mode="Perform")automl.fit(X_train, y_train)# compute the accuracy on test datapredictions = automl.predict_all(X_test)print(predictions.head())print("Test accuracy:", accuracy_score(y_test, predictions["label"].astype(int)))
回归示例
波士顿房价数据的回归示例。在测试数据上，它的均方根误差（MSE）为〜10.85。
import numpy as npimport pandas as pdfrom sklearn.datasets import load_bostonfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_errorfrom supervised.automl import AutoML # mljar-supervised# Load the datahousing = load_boston()X_train, X_test, y_train, y_test = train_test_split(    pd.DataFrame(housing.data, columns=housing.feature_names),    housing.target,    test_size=0.25,    random_state=123,)# train models with AutoMLautoml = AutoML(mode="Explain")automl.fit(X_train, y_train)# compute the MSE on test datapredictions = automl.predict(X_test)print("Test MSE:", mean_squared_error(y_test, predictions))
更多例子
收入分类[10]-这是普查数据的二元分类任务
鸢尾花分类[11]-这是鸢尾花数据的多类分类
房屋价格回归[12]-这是波士顿房屋数据的回归任务
有关详细信息，请检查mljar-supervised docs[13]。
如果您需要帮助，请提交问题或加入我们的Slack 频道[14]。
AutoML 报告
运行 AutoML 的报告将包含该表，其中包含有关每个模型得分和训练模型所需时间的信息。对于每个模型，都有一个链接，您可以单击该链接以查看模型的详细信息。所有 ML 模型的性能都以散点图和箱形图的形式表示，因此您可以直观地检查哪种算法表现最佳：throphy:。

AutoML排行榜
Decision Tree报告
Decision Tree可视化摘要示例。对于分类任务，提供了其他指标：
混淆矩阵
阈值（针对二进制分类任务进行了优化）
F1 分数
准确性
精确度，召回率，MCC

决策树摘要
LightGBM报告
LightGBM摘要示例：

决策树摘要
从 PyPi 存储库安装：
pip install mljar-supervised
从源代码安装：
git clone https://github.com/mljar/mljar-supervised.gitcd mljar-supervisedpython setup.py install
安装开发环境，
git clone https://github.com/mljar/mljar-supervised.gitvirtualenv venv --python=python3.6source venv/bin/activatepip install -r requirements.txtpip install -r requirements_dev.txt
在 docker 中运行，
FROM python:3.7-slim-busterRUN apt-get update && apt-get -y updateRUN apt-get install -y build-essential python3-pip python3-devRUN pip3 -q install pip --upgradeRUN pip3 install mljar-supervised jupyterCMD ["jupyter", "notebook", "--port=8888", "--no-browser", "--ip=0.0.0.0", "--allow-root"]
参考资料
[1]
dtreeviz: https://github.com/parrt/dtreeviz[2]
Caruana 论文中: http://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf[3]
黄金特征: https://supervised.mljar.com/features/golden_features/[4]
特征选择: https://supervised.mljar.com/features/features_selection/[5]
AutoML 模式: https://supervised.mljar.com/features/modes/[6]
Markdown 报告中: https://github.com/mljar/mljar-examples/tree/master/Income_classification/AutoML_1#automl-leaderboard[7]
Markdown 报告: https://github.com/mljar/mljar-examples/blob/master/Income_classification/AutoML_1/5_Default_Xgboost/README.md[8]
降价报告: https://github.com/mljar/mljar-examples/blob/master/Income_classification/AutoML_1/2_DecisionTree/README.md[9]
Markdown 报告中: https://github.com/mljar/mljar-examples/blob/master/Income_classification/AutoML_1/3_Linear/README.md[10]
收入分类: https://github.com/mljar/mljar-examples/tree/master/Income_classification[11]
鸢尾花分类: https://github.com/mljar/mljar-examples/tree/master/Iris_classification[12]
房屋价格回归: https://github.com/mljar/mljar-examples/tree/master/House_price_regression[13]
mljar-supervised docs: https://supervised.mljar.com/[14]
Slack 频道: https://mljar-supervised.slack.com/join/shared_invite/zt-gkhfsvhw-H6LMKxxV5adeTmn9V7nbZw#/