特征工程recipes

  |  

摘要: 《Python Feature Engineering Cookbook: Over 70 recipes for creating, engineering, and transforming features to build machine learning models》

【对算法,数学,计算机感兴趣的同学,欢迎关注我哈,阅读更多原创文章】
我的网站:潮汐朝夕的生活实验室
我的公众号:算法题刷刷
我的知乎:潮汐朝夕
我的github:FennelDumplings
我的leetcode:FennelDumplings


一本特征工程方面的 Cookbook,2020年的书,内容上都是老生常谈的东西:缺失值,分类型变量处理,文本特征提取,交易数据和时间序列数据特征提取,特征组合,特征变换,日期和时间的使用。它的代码仓库比较有价值,70 个 Recipes 代码可以当手册查着用。

本书主要涉及以下内容:

  • Simplify your feature engineering pipelines with powerful Python packages
  • Get to grips with imputing missing values
  • Encode categorical variables with a wide set of techniques
  • Extract insights from text quickly and effortlessly
  • Develop features from transactional data and time series data
  • Derive new features by combining existing variables
  • Understand how to transform, discretize, and scale your variables
  • Create informative variables from date and time

Foreseeing Variable Problems When Building ML Models

1
2
3
4
5
6
7
8
9
indetifying-variables-types
Quantifying-missing-data
Determining-cardinality
Pinpointing-rare-categories
Identifying-a-linear-relationship
Identifying-a-normal-distribution
Distinguishing-variable-distribution
Highlighting-outliers
Comparing-feature-magnitude

缺失值估算与填充 Imputing Missing Data

1
2
3
4
5
6
7
8
9
10
11
Removing-observations-with-missing-data
Performing-mean-or-median-imputation
Implementing-mode-or-frequent-category-imputation
Replacing-missing-values-by-an-arbitrary-number
Capturing-missing-values-in-a-bespoke-category
Replacing-missing-values-by-a-value-at-the-end-of-the-distribution
Implementing-random-sample-imputation
Adding-a-missing-value-indicator-variable
Performing-multivariate-imputation-by-chained-equations-MICE
Assembling-an-imputation-pipeline-with-Scikit-learn
Assembling-an-imputation-pipeline-with-Feature-Engine

编码分类变量 Encoding Categorical Variables

1
2
3
4
5
6
7
8
9
10
One-hot-encoding
One-hot-encoding-top-categories
Replacing-categories-by-ordinal-numbers
replacing-categories-by-counts-frequency
ordered-ordinal-encoding
target-mean-encoding
weight-of-evidence
grouping-rare-categories
Binary-Encoding
Feature-Hashing

转换数值变量 Transforming Numerical Variables

1
2
3
4
5
6
logarithmic-transformation
reciprocal-transformation
square-cube-root
power-transformation
Box-Cox-transformation
Yeo-Johnson-transformation

执行变量离散化 Performing Variable Discretization

1
2
3
4
5
6
Equal-width-discretisation
Equal-frequency-discretisation
Discretisation-plus-categorical-encoding
Arbitrary-interval-discretisation
Discretisation-Kmeans
Discretisation-with-decision-trees

处理异常值 Working with Outliers

1
2
3
4
Outlier-Trimming
Winsorisation
Capping
Zero-coding

从日期和时间中提取特征 Deriving Features from Dates and Time Variables

1
2
3
4
5
6
Extracting-date-and-time-part
Deriving-year-month-semester-quarter
Creating-representations-of-week-day
Extracting-time-parts
Capturing-elapsed-time-between-2-variables
different-time-zones

执行特征缩放 Performing Feature Scaling

1
2
3
4
5
6
Standardization
Mean-normalization
MinMaxScaling
Maximum-Absolute-Scaling
Robust-Scaling
Scaling-to-unit-length

创建新特征 Applying Mathematical Computations to Features

1
2
3
4
5
Add-Multiply-Features
Substraction-Quotient-Features
PolynomialExpansion
Combining-features-with-trees
PCA

使用 Featuretools 从关系数据中提取特征 Creating Features with Transactional and Time Series Data

1
2
3
4
5
Aggregating-transactional-data-with-math-operations
aggregate-transactional-data-in-time-windows
Identifying-and-counting-local-maxima-and-minima
Calculating-distance-between-events
Creating-features-with-featuretools

从文本变量中提取特征 Extracting Features from Text Variables

1
2
3
4
5
Capturing-text-complexity-in-features
Sentence-tokenization
bag-of-words
TFIDF
cleaning-text

Share