《特征工程入门与实践》

字数统计: 1.1k字 | 阅读时长: 4分

2021-03-04

摘要: 《特征工程入门与实践》

【对算法，数学，计算机感兴趣的同学，欢迎关注我哈，阅读更多原创文章】
我的网站：潮汐朝夕的生活实验室
我的公众号：算法题刷刷
我的知乎：潮汐朝夕
我的github：FennelDumplings
我的leetcode：FennelDumplings

第1章特征工程简介　　

数据的结构 (The structure of data)
定量数据和定性数据 (Quantitative versus qualitative data)
数据的4个等级 (The four levels of data)
- 定类等级 (The nominal level)
- 定序等级 (The ordinal level)
- 定距等级 (The interval level)
- 定比等级 (The ratio level)
数据等级总结 (Recap of the levels of data)

识别数据中的缺失值 (Identifying missing values in data)
处理数据集中的缺失值 (Dealing with missing values in a dataset)
- 删除有害的行 (Removing harmful rows of data)
- 填充缺失值 (Imputing the missing values in data)
- 在机器学习流水线中填充值 (Imputing values in a machine learning pipeline)
标准化和归一化 (Standardization and normalization)
- z分数标准化 (Z-score standardization)
- min-max标准化 (The min-max scaling method)
- 行归一化 (The row normalization method)
- 整合起来 (Putting it all together)

检查数据集 (Examining our dataset)
填充分类特征 (Imputing categorical features)
- 自定义填充器 (Custom imputers)
- 自定义分类填充器 (Custom category imputer)
- 自定义定量填充器 (Custom quantitative imputer)
编码分类变量 (Encoding categorical variables)
- (定类等级的编码 Encoding at the nominal level)
- (定序等级的编码 Encoding at the ordinal level)
- (将连续特征分箱 Bucketing continuous features into categories)
扩展数值特征 (Extending numerical features)
- 根据胸部加速度计识别动作的数据集 (Activity recognition from the Single Chest-Mounted Accelerometer dataset)
- 多项式特征 (Polynomial features)
针对文本的特征构建 (Text-specific feature construction)
- 词袋法 (Bag of words representation)
- CountVectorizer (CountVectorizer)
- TF-IDF向量化器 (The Tf-idf vectorizer)

维度缩减：特征转换、特征选择与特征构建 (Dimension reduction – feature transformations versus feature selection versus feature construction)
主成分分析 (Principal Component Analysis)
- 中心化和缩放对PCA的影响 (How centering and scaling data affects PCA)
- A deeper look into the principal components
线性判别分析 (Linear Discriminant Analysis)

数据的参数假设 (Parametric assumptions of data)
- 非参数谬误 (Non-parametric fallacy)
受限玻尔兹曼机 (Restricted Boltzmann Machines)
- 不一定降维 (Not necessarily dimension reduction)
- 受限玻尔兹曼机的图 (The graph of a Restricted Boltzmann Machine)
- 玻尔兹曼机的限制 (The restriction of a Boltzmann Machine)
- 数据重建 (Reconstructing the data)
伯努利受限玻尔兹曼机 (The BernoulliRBM)
学习文本特征：词向量 (Learning text features – word vectorizations)
- 词嵌入 (Word embeddings)
- 两种词嵌入方法：Word2vec和GloVe (Two approaches to word embeddings - Word2vec and GloVe)
- Word2vec：另一个浅层神经网络 (Word2Vec - another shallow neural network)
- 创建Word2vec词嵌入的gensim包 (The gensim package for creating Word2vec embeddings)
- 词嵌入的应用：信息检索 (Application of word embeddings - information retrieval)