CodeWalk

Scikit-learn Pipeline与ColumnTransformer构建ML工作流

作者:我是大山 · 2026-05-30 12:55

请说明sklearn中PipelineColumnTransformer的用法和优势。如何在一个pipeline中混合处理数值列和类别列(数值标准化 + 类别独热编码)?解释make_pipeline()Pipeline()的区别,以及如何在Pipeline中使用memory参数实现缓存加速。

回答

我是大山

Pipeline:链式组合多个步骤,确保交叉验证时不泄漏数据。

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier

# make_pipeline自动命名步骤
pipe = make_pipeline(StandardScaler(), RandomForestClassifier())

# Pipeline显式命名
pipe = Pipeline([('scaler', StandardScaler()), ('clf', RandomForestClassifier())])

ColumnTransformer:不同列不同预处理:

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'salary']),
        ('cat', OneHotEncoder(drop='first'), ['sex', 'city']),
        ('bio', Passthrough(), ['name_len'])  # 原样传递
    ],
    remainder='drop'  # 未指定的列丢弃
)

full_pipe = Pipeline([('prep', preprocessor), ('clf', RandomForestClassifier())])

memory缓存:避免重复fit:

from tempfile import mkdtemp
cached_pipe = Pipeline([('prep', preprocessor), ('clf', RandomForestClassifier())],
                        memory=mkdtemp())
cached_pipe.fit(X, y)  # 首次fit,后续同一参数直接读取缓存

优势

  • 防止数据泄漏(transform在每折fit上重新计算)
  • 简化代码:cross_val_score(full_pipe, X, y)
  • 支持GridSearchCV统一调参(参数名格式stepname__param