Python之sklearn:GridSearchCV()和fit()函数的简介、具体案例、使用方法之详细攻略

阅读 98

2022-02-10


Python之sklearn:GridSearchCV()和fit()函数的简介、具体案例、使用方法之详细攻略



目录

​​GridSearchCV()和fit()函数的使用方法​​

​​GridSearchCV()函数的简介、具体案例​​


GridSearchCV()和fit()函数的使用方法

利用sklearn对ML模型进行网格搜索调参的函数封装

# 利用sklearn对ML模型进行网格搜索调参的函数封装
def ModelC_GSCV(estimator, data_X,data_y,param_grid):
from sklearn.model_selection import GridSearchCV
# 参数网格搜索法, 选取后可以注释掉
print("search best parms:")
GSCV_model = GridSearchCV(estimator, param_grid,cv=10, scoring="f1", verbose=True)

# 训练:如果不用GSCV_model可以直接用new的XGBClassifier() model
ModelC = GSCV_model.fit(X=data_X, y=data_y,
# eval_set=[(data_X, data_y), ], # 验证集1
eval_metric="logloss", # 评价损失 二分类选择 auc、logloss 多分类选择 mlogloss
early_stopping_rounds=10, # 连续N次分值不再优化则提前停止
verbose=True, # 和silent参数类似,是否打印训练过程的日志
objective='binary:logistic',scale_pos_weight=49
)
# 选取最佳参数
print("Best score: %f using parms: %s" % (ModelC.best_score_, ModelC.best_params_))
return ModelC.best_estimator_



GridSearchCV()函数的简介、具体案例


class GridSearchCV Found at: sklearn.model_selection._search

class GridSearchCV(BaseSearchCV):
    """Exhaustive search over specified parameter values for an estimator.
    Important members are fit, predict.GridSearchCV implements a "fit" and a "score" method. It also implements "predict", "predict_proba", "decision_function", "transform" and "inverse_transform" if they are implemented in the  estimator used. The parameters of the estimator used to apply these methods are  optimized by cross-validated grid-search over a parameter grid.
    Read more in the :ref:`User Guide <grid_search>`.


在以下位置找到GridSearchCV类:sklearn.model_selection._search

GridSearchCV类(BaseSearchCV):

“”“详尽搜索指定参数的估计值

重要的成员是fit,predict.GridSearchCV实现“ fit”和“ score”方法。 如果在使用的估算器中实现了``predict'',`predict_proba'',``decision_function'',``transform''和``inverse_transform'',则还可以实现它们。 通过对参数网格进行交叉验证的网格搜索来优化用于应用这些方法的估计器的参数。

在:ref:ʻ用户指南<grid_search>`中了解更多信息。


   Parameters
    ----------
    estimator : estimator object. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.
    
    param_grid : dict or list of dictionaries. Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
    
    scoring : str, callable, list/tuple or dict, default=None. A single str (see :ref:`scoring_parameter`) or a callable (see :ref:`scoring`) to evaluate the predictions on the test set. 
     For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.
    NOTE that when using custom scorers, each scorer should return a  single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.
    See :ref:`multimetric_grid_search` for an example.
    If None, the estimator's score method is used.
    
    n_jobs : int, default=None.  Number of jobs to run in parallel.  ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.  ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.
    .. versionchanged:: v0.20.  `n_jobs` default changed from 1 to None
    
    pre_dispatch : int, or str, default=n_jobs. Controls the number of jobs that get dispatched during parallel  execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
    - None, in which case all the jobs are immediately  created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
    - An int, giving the exact number of total jobs that are spawned
    - A str, giving an expression as a function of n_jobs, as in '2*n_jobs'
    
    iid : bool, default=False.  If True, return the average score across folds, weighted by the number  of samples in each test set. In this case, the data is assumed to be identically distributed across the folds, and the loss minimized is  the total loss per sample, and not the mean loss across the folds.
    .. deprecated:: 0.22. Parameter ``iid`` is deprecated in 0.22 and will be removed in 0.24
    
    cv : int, cross-validation generator or an iterable, default=None. Determines the cross-validation splitting strategy.  Possible inputs for cv are:
    - None, to use the default 5-fold cross validation,
    - integer, to specify the number of folds in a `(Stratified)KFold`,
    - :term:`CV splitter`,
    - An iterable yielding (train, test) splits as arrays of indices.
    For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.
    Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.
    .. versionchanged:: 0.22. ``cv`` default value if None changed from 3-fold to 5-fold.
    
    refit : bool, str, or callable, default=True. Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end.
       Where there are considerations other than maximum score in     choosing a best estimator, ``refit`` can be set to a function which     returns the selected ``best_index_`` given ``cv_results_``. In that     case, the ``best_estimator_`` and ``best_params_`` will be set     according to the returned ``best_index_`` while the ``best_score_``     attribute will not be available.
    The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance.
    Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer.
    See ``scoring`` parameter to know more about multiple metric evaluation.  .. versionchanged:: 0.20. Support for callable added.
    
    verbose : integer. Controls the verbosity: the higher, the more messages.
    
    error_score : 'raise' or numeric, default=np.nan. Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit  step, which will always raise the error.
    
    return_train_score : bool, default=False.  If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that  yield the best generalization performance.
    .. versionadded:: 0.19
    .. versionchanged:: 0.21. Default value was changed from ``True`` to ``False``



参数
----------
estimator :估计器对象。假定这样做是为了实现scikit-learn估计器接口。估算器需要提供一个“得分”功能,或者必须传递“得分”。


param_grid :字典或字典列表。使用参数名称(`str`)作为键的字典,以及将尝试用作值的参数设置列表,或此类字典的列表,在这种情况下,将探索列表中每个字典所跨越的网格。这样可以搜索任何顺序的参数设置。


scoring :str,可调用,列表/元组或字典,默认=无。单个str(请参阅scoring_parameter)或可调用项(请参阅scoring)来评估测试集上的预测。
      要评估多个指标,请给出(唯一的)字符串列表或以名称为键,将可调用项为值的字典。

       请注意,使用自定义计分器时,每个计分器应返回一个单个值。返回值列表/数组的度量函数可以包装到多个计分器中,每个计分器都返回一个值。
有关示例,请参见multimetric_grid_search。
        如果为None,则使用估算器的计分方法。


n_jobs :int,默认=无。要并行运行的作业数。除非在:obj:`joblib.parallel_backend`上下文中,否则“ None``表示1。 -1表示使用所有处理器。有关更多详细信息,请参见术语<n_jobs>`。

..版本已更改:: v0.20。 `n_jobs`默认从1更改为None


pre_dispatch 或str,默认= n_jobs。控制在并行执行期间分派的作业数量。当调度的作业数量超过CPU的处理能力时,减少此数量可能有助于避免内存消耗激增。该参数可以是:

-None,在这种情况下,将立即创建并产生所有作业。使用它进行轻量级和快速运行的作业,以避免因按需生成作业而造成延迟
-一个int,给出产生的确切总工作数
-一个str,根据n_jobs给出表达式,如'2 * n_jobs'


iid :bool,默认= False。如果为True,则按倍数返回平均得分,并按每个测试集中的样本数加权。在这种情况下,假设数据在折痕上分布相同,并且最小化的损失是每个样品的总损失,而不是折痕的平均损失。
..不建议使用:: 0.22。参数“ iid”在0.22中已弃用,在0.24中将被删除


cv :int,交叉验证生成器或可迭代的default = None。确定交叉验证拆分策略。简历的可能输入是:
-None,要使用默认的5-fold交叉验证,
-integer整数,用于指定“(分层)KFold”中的折叠次数,
-:CV splitter`,
-可迭代的yielding (训练,测试)拆分为索引数组。
      对于整数/无输入,如果估计器是分类器,而y是二进制或多类,则使用:StratifiedKFold。在所有其他情况下,都使用KFold类。
请参阅:ref:ʻ用户指南<cross_validation>`,了解可以在此处使用的各种交叉验证策略。
..版本已更改:: 0.22。如果无从3倍更改为5倍,则为cv默认值。


refit :bool,str或callable,默认为True。使用在整个数据集中找到的最佳参数重新拟合估算器。对于多指标评估,这需要是一个“ str”,表示计分器,该计分器将被用于寻找最佳参数,以最终拟合估计器。
在选择最佳估算器时,除了最大分数以外,还可以将``refit''设置为一个函数,该函数在给定``cv_results_''的情况下返回所选的``best_index_''。在这种情况下,将根据返回的``best_index_''设置``best_estimator_''和``best_params_'',而``best_score_''属性将不可用。
可以在“ best_estimator_”属性中使用经过重新调整的估计器,并允许在此“ GridSearchCV”实例上直接使用“预测”。
同样对于多指标评估,属性``best_index _'',``best_score_''和``best_params_''仅在设置了``refit''后才可用,并且将通过该特定计分器确定所有属性。
请参阅``评分''参数以了解有关多指标评估的更多信息。 ..版本已更改:: 0.20。支持添加可调用。


verbose :整数。控制详细程度:越高,消息越多。


error_score :“raise”或数字,默认值= np.nan。如果估算器拟合出现错误,则分配给分数的值。如果设置为“ raise”,则会引发错误。如果给出数值,则引发FitFailedWarning。此参数不会影响重新安装步骤,这将始终引发错误。


return_train_score :布尔值,默认为False。 如果为False,则cv_results_属性将不包括训练得分。 计算培训分数用于了解不同的参数设置如何影响过拟合/欠拟合权衡。 但是,在训练集上计算分数可能会在计算上昂贵,并且并非严格要求选择产生最佳泛化性能的参数。
..版本添加:: 0.19
..版本已更改:: 0.21。 默认值从``True''更改为``False''


    Examples

    --------

    >>> from sklearn import svm, datasets

    >>> from sklearn.model_selection import GridSearchCV

    >>> iris = datasets.load_iris()

    >>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

    >>> svc = svm.SVC()

    >>> clf = GridSearchCV(svc, parameters)

    >>> clf.fit(iris.data, iris.target)

    GridSearchCV(estimator=SVC(),

    param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')})

    >>> sorted(clf.cv_results_.keys())

    ['mean_fit_time', 'mean_score_time', 'mean_test_score',...

    'param_C', 'param_kernel', 'params',...

    'rank_test_score', 'split0_test_score',...

    'split2_test_score', ...

    'std_fit_time', 'std_score_time', 'std_test_score']



  Attributes
    ----------
    cv_results_ : dict of numpy (masked) ndarrays.A dict with keys as column headers and values as columns, that can be imported into a pandas ``DataFrame``.
    
    For instance the below given table
    +------------+-----------+------------+-----------------+---+---------+
    |param_kernel|param_gamma|param_degree|split0_test_score|...
     |rank_t...|
     +============+===========+============+========
     =========+===+=========+
    |  'poly'    |     --    |      2     |       0.80      |...|    2    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'poly'    |     --    |      3     |       0.70      |...|    4    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.1   |     --     |       0.80      |...|    3    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.2   |     --     |       0.93      |...|    1    |
    +------------+-----------+------------+-----------------+---+---------+
        will be represented by a ``cv_results_`` dict of:: 
    {
    'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
    mask = [False False False False]...)
    'param_gamma': masked_array(data = [-- -- 0.1 0.2],
    mask = [ True  True False False]...),
    'param_degree': masked_array(data = [2.0 3.0 -- --],
    mask = [False False  True  True]...),
    'split0_test_score'  : [0.80, 0.70, 0.80, 0.93],
    'split1_test_score'  : [0.82, 0.50, 0.70, 0.78],
    'mean_test_score'    : [0.81, 0.60, 0.75, 0.85],
    'std_test_score'     : [0.01, 0.10, 0.05, 0.08],
    'rank_test_score'    : [2, 4, 3, 1],
    'split0_train_score' : [0.80, 0.92, 0.70, 0.93],
    'split1_train_score' : [0.82, 0.55, 0.70, 0.87],
    'mean_train_score'   : [0.81, 0.74, 0.70, 0.90],
    'std_train_score'    : [0.01, 0.19, 0.00, 0.03],
    'mean_fit_time'      : [0.73, 0.63, 0.43, 0.49],
    'std_fit_time'       : [0.01, 0.02, 0.01, 0.01],
    'mean_score_time'    : [0.01, 0.06, 0.04, 0.04],
    'std_score_time'     : [0.00, 0.00, 0.00, 0.01],
    'params'             : [{'kernel': 'poly', 'degree': 2}, ...],
    }
    
    NOTE
    
    The key ``'params'`` is used to store a list of parameter settings dicts for all the parameter candidates.
    The ``mean_fit_time``, ``std_fit_time``, ``mean_score_time`` and  ``std_score_time`` are all in seconds.
    For multi-metric evaluation, the scores for all the scorers are available in the ``cv_results_`` dict at the keys ending with that scorer's name (``'_<scorer_name>'``) instead of ``'_score'`` shown above. ('split0_test_precision', 'mean_train_precision' etc.)
    
    best_estimator_ : estimator. Estimator that was chosen by the search, i.e. estimator  which gave highest score (or smallest loss if specified) on the left out data. Not available if ``refit=False``.
    See ``refit`` parameter for more information on allowed values.
    
    best_score_ : float. Mean cross-validated score of the best_estimator. For multi-metric evaluation, this is present only if ``refit`` is specified. This attribute is not available if ``refit`` is a function.
    
    best_params_ : dict. Parameter setting that gave the best results on the hold out data. For multi-metric evaluation, this is present only if ``refit`` is specified.
    
    best_index_ : int. The index (of the ``cv_results_`` arrays) which corresponds to the best candidate parameter setting. The dict at ``search.cv_results_['params'][search.best_index_]`` gives the parameter setting for the best model, that gives the highest mean score (``search.best_score_``).
    For multi-metric evaluation, this is present only if ``refit`` is specified.
    
    scorer_ : function or a dict.  Scorer function used on the held out data to choose the best parameters for the model. For multi-metric evaluation, this attribute holds the validated ``scoring`` dict which maps the scorer key to the scorer callable.
    
    n_splits_ : int. The number of cross-validation splits (folds/iterations).
    
    refit_time_ : float. Seconds used for refitting the best model on the whole dataset. This is present only if ``refit`` is not False.
       .. versionadded:: 0.20
    
    Notes
    -----
    The parameters selected are those that maximize the score of the left  out data, unless an explicit score is passed in which case it is used instead.
    If `n_jobs` was set to a value higher than one, the data is copied for  each point in the grid (and not `n_jobs` times). This is done for efficiency reasons if individual jobs take very little time, but may raise errors if the dataset is large and not enough memory is available.  A  workaround in this case is to set `pre_dispatch`. Then, the memory is copied only  `pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 * n_jobs`.
    
    See Also
    ---------
    :class:`ParameterGrid`:
    generates all the combinations of a hyperparameter grid.
    
    :func:`sklearn.model_selection.train_test_split`:
    utility function to split the data into a development set usable for fitting a GridSearchCV instance and an evaluation set for  its final evaluation.
    
    :func:`sklearn.metrics.make_scorer`:
    Make a scorer from a performance metric or loss function.
    
    """



属性
----------
cv_results_:numpy(masked)ndarrays的字典。字典可以将键作为列标题,将值作为列,可以将其导入到pandas ``DataFrame''中。

例如下面的表格

    +------------+-----------+------------+-----------------+---+---------+
    |param_kernel|param_gamma|param_degree|split0_test_score|...
     |rank_t...|
     +============+===========+============+========
     =========+===+=========+
    |  'poly'    |     --    |      2     |       0.80      |...|    2    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'poly'    |     --    |      3     |       0.70      |...|    4    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.1   |     --     |       0.80      |...|    3    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.2   |     --     |       0.93      |...|    1    |
    +------------+-----------+------------+-----------------+---+---------+


将由以下内容的“ cv_results_”字典表示:{

    'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
    mask = [False False False False]...)
    'param_gamma': masked_array(data = [-- -- 0.1 0.2],
    mask = [ True  True False False]...),
    'param_degree': masked_array(data = [2.0 3.0 -- --],
    mask = [False False  True  True]...),
    'split0_test_score'  : [0.80, 0.70, 0.80, 0.93],
    'split1_test_score'  : [0.82, 0.50, 0.70, 0.78],
    'mean_test_score'    : [0.81, 0.60, 0.75, 0.85],
    'std_test_score'     : [0.01, 0.10, 0.05, 0.08],
    'rank_test_score'    : [2, 4, 3, 1],
    'split0_train_score' : [0.80, 0.92, 0.70, 0.93],
    'split1_train_score' : [0.82, 0.55, 0.70, 0.87],
    'mean_train_score'   : [0.81, 0.74, 0.70, 0.90],
    'std_train_score'    : [0.01, 0.19, 0.00, 0.03],
    'mean_fit_time'      : [0.73, 0.63, 0.43, 0.49],
    'std_fit_time'       : [0.01, 0.02, 0.01, 0.01],
    'mean_score_time'    : [0.01, 0.06, 0.04, 0.04],
    'std_score_time'     : [0.00, 0.00, 0.00, 0.01],
    'params'             : [{'kernel': 'poly', 'degree': 2}, ...],
    }


注意

键``params''用于存储所有候选参数的参数设置字典列表。

``mean_fit_time'',``std_fit_time'',``mean_score_time''和``std_score_time''都以秒为单位。

对于多指标评估,所有得分者的得分都可以在“ cv_results_” dict中以该得分者的名字(“ _ <scorer_name>””)而不是“ _score”的键获得。如上所示。 (“ split0_test_precision”,“ mean_train_precision”等)


best_estimator_:估算器。搜索选择的估算器,即在剩余数据上给出最高分(或最小损失,如果指定)的估算器。如果``refit = False'',则不可用。

有关允许值的更多信息,请参见“改装”参数。


best_score_:浮动。 best_estimator的平均交叉验证得分。对于多指标评估,仅在指定``refit''时才存在。如果``refit''是一个函数,则此属性不可用。


best_params_:字典。参数设置可使保留数据获得最佳结果。对于多指标评估,仅在指定``refit''时才存在。


best_index_:整数。与“ cv_results_”数组的索引相对应的最佳候选参数设置。 search.cv_results _ ['params'] [search.best_index_]上的字典给出了最佳模型的参数设置,该模型给出了最高的平均得分(“ search.best_score_”)。
对于多指标评估,仅在指定``refit''时才存在。


scorer_:函数或字典。对保留的数据使用记分器功能,以为模型选择最佳参数。对于多指标评估,此属性保存已验证的“评分”字典,该评分将记分员键映射到可调用的记分员。


n_splits_:整数。交叉验证拆分(折叠/迭代)的数量。


refit_time_:浮动。用于在整个数据集中重新拟合最佳模型的秒数。仅当``refit''不为False时才存在。
..版本添加:: 0.20


注意
-----
所选择的参数是那些使遗留数据的分数最大化的参数,除非传递了显式分数,否则将使用该显式分数。
如果将n_jobs的值设置为大于1的值,则会为网格中的每个点复制数据(而不是n_jobs次)。如果出于效率考虑,这样做是因为单个作业花费的时间很少,但是如果数据集很大且没有足够的可用内存,则可能会引发错误。这种情况下的解决方法是设置`pre_dispatch`。然后,该内存仅被复制一次pre_dispatch多次。 pre_dispatch的合理值是2 * n_jobs。


也可以看看
---------
ParameterGrid
生成超参数网格的所有组合。

:func:`sklearn.model_selection.train_test_split`:
实用程序功能将数据分为可用于拟合GridSearchCV实例的开发集和用于其最终评估的评估集。

:func:`sklearn.metrics.make_scorer`:
根据绩效指标或损失函数确定得分手。

“”


    _required_parameters = ["estimator", "param_grid"]

    @_deprecate_positional_args

    def __init__(self, estimator, param_grid, *, scoring=None, 

        n_jobs=None, iid='deprecated', refit=True, cv=None, 

        verbose=0, pre_dispatch='2*n_jobs', 

        error_score=np.nan, return_train_score=False):

        super().__init__(estimator=estimator, scoring=scoring, 

         n_jobs=n_jobs, iid=iid, refit=refit, cv=cv, verbose=verbose, 

         pre_dispatch=pre_dispatch, error_score=error_score, 

         return_train_score=return_train_score)

        self.param_grid = param_grid

        _check_param_grid(param_grid)


    def _run_search(self, evaluate_candidates):

        """Search all candidates in param_grid"""

        evaluate_candidates(ParameterGrid(self.param_grid))



精彩评论(0)

0 0 举报