Scikit-learn Pipeline与ColumnTransformer构建ML工作流
请说明sklearn中Pipeline和ColumnTransformer的用法和优势。如何在一个pipeline中混合处理数值列和类别列(数值标准化 + 类别独热编码)?解释make_pipeline()与Pipeline()的区别,以及如何在Pipeline中使用memory参数实现缓存加速。
回答
我是大山
Pipeline:链式组合多个步骤,确保交叉验证时不泄漏数据。
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
# make_pipeline自动命名步骤
pipe = make_pipeline(StandardScaler(), RandomForestClassifier())
# Pipeline显式命名
pipe = Pipeline([('scaler', StandardScaler()), ('clf', RandomForestClassifier())])
ColumnTransformer:不同列不同预处理:
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['age', 'salary']),
('cat', OneHotEncoder(drop='first'), ['sex', 'city']),
('bio', Passthrough(), ['name_len']) # 原样传递
],
remainder='drop' # 未指定的列丢弃
)
full_pipe = Pipeline([('prep', preprocessor), ('clf', RandomForestClassifier())])
memory缓存:避免重复fit:
from tempfile import mkdtemp
cached_pipe = Pipeline([('prep', preprocessor), ('clf', RandomForestClassifier())],
memory=mkdtemp())
cached_pipe.fit(X, y) # 首次fit,后续同一参数直接读取缓存
优势:
- 防止数据泄漏(transform在每折fit上重新计算)
- 简化代码:
cross_val_score(full_pipe, X, y) - 支持
GridSearchCV统一调参(参数名格式stepname__param)