离散 feature 的 encoding 分为两种情况:
1、离散 feature 的取值之间没有大小的意义,比如color:[red,blue],那么就使用 one-hot encoding
2、离散 feature 的取值有大小的意义,比如size:[X,XL,XXL],那么就使用数值的映射{X:1,XL:2,XXL:3}
In [90]:
import numpy as npimport pandas as pdfrom pandas import Series, DataFramenp.set_printoptions(precision=4) 
×
…
 
  
In [91]:
            
df = pd.DataFrame([['green', 'M', 10.1, 'class1'],['red', 'L', 13.5, 'class2'],['blue', 'XL', 15.3, 'class1']])df.columns = ['color', 'size', 'prize', 'class label']df 
×
    
Out[91]:
 
| color | size | prize | class label | |
| 0 | green | M | 10.1 | class1 | 
| 1 | red | L | 13.5 | class2 | 
| 2 | blue | XL | 15.3 | class1 | 
 
    
…
 
  
In [92]:
size_mapping = {'XL': 3,'L': 2,'M': 1}df['size'] = df['size'].map(size_mapping)df 
         
         
   
×
 
Out[92]:
 
      
| color | size | prize | class label | |
| 0 | green | 1 | 10.1 | class1 | 
| 1 | red | 2 | 13.5 | class2 | 
| 2 | blue | 3 | 15.3 | class1 | 
 
    
…
 
            
# -----------------------------------------------# 使用 pd.get_dummies() 进行处理pd.get_dummies(df) 
         
         
   
×
 
     
    
    
Out[93]:
| size | prize | color_blue | color_green | color_red | class label_class1 | class label_class2 | |
| 0 | 1 | 10.1 | 0 | 1 | 0 | 1 | 0 | 
| 1 | 2 | 13.5 | 0 | 0 | 1 | 0 | 1 | 
| 2 | 3 | 15.3 | 1 | 0 | 0 | 1 | 0 | 
 
    
…
 
  
In [94]:
 
     
 
            
df
   
×
 
     
    
    
Out[94]:
 
      
| color | size | prize | class label | |
| 0 | green | 1 | 10.1 | class1 | 
| 1 | red | 2 | 13.5 | class2 | 
| 2 | blue | 3 | 15.3 | class1 | 
 
    
…
 
  
In [95]:
        
       
x
# -----------------------------------------------# 使用  sklearn.feature_extraction.DictVectorizer 进行处理feature_list = []label_list = []for row in df.index[:]:label_list.append(df.ix[row][-1])rowDict = {}for i in range(0, len(df.ix[row])-1):rowDict[df.columns[i]] = df.ix[row][i]feature_list.append(rowDict)feature_list
×
Out[95]:
 
      
[{'color': 'green', 'prize': 10.1, 'size': 1},
 {'color': 'red', 'prize': 13.5, 'size': 2},
 {'color': 'blue', 'prize': 15.300000000000001, 'size': 3}]    
…
 
  
In [96]:
 
            
label_list 
×
    
Out[96]:
 
      
['class1', 'class2', 'class1']    
…
 
  
In [97]:
            
from sklearn.feature_extraction import DictVectorizervec = DictVectorizer()# DictVectorizer.fit_transform() 接受一个由 dict 组成的 listdummy_x = vec.fit_transform(feature_list).toarray()dummy_x 
   
×
Out[97]:
 
array([[  0. ,   1. ,   0. ,  10.1,   1. ],
       [  0. ,   0. ,   1. ,  13.5,   2. ],
       [  1. ,   0. ,   0. ,  15.3,   3. ]])
    
…
 
  
In [98]:
 
from sklearn import preprocessinglabel_bin = preprocessing.LabelBinarizer()# preprocessing.LabelBinarizer.fit_transform() 接受一个 listdummy_y = label_bin.fit_transform(label_list)dummy_y
         
   
×
Out[98]:
 
array([[0],
       [1],
       [0]])
…
 
  
In [99]:
 
            
# 测试 当 label 种类大于 2 的时候的效果
df['class label'][2] = 'class3'
df
×
C:\Users\rHotD\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy from ipykernel import kernelapp as app
 
Out[99]:
 
      
| color | size | prize | class label | |
| 0 | green | 1 | 10.1 | class1 | 
| 1 | red | 2 | 13.5 | class2 | 
| 2 | blue | 3 | 15.3 | class3 | 
 
    
…
  
In [100]:
feature_list = []
label_list = []
for row in df.index[:]:
    label_list.append(df.ix[row][-1])
    rowDict = {}
    for i in range(0, len(df.ix[row])-1):
        rowDict[df.columns[i]] = df.ix[row][i]
    feature_list.append(rowDict)
dummy_y = label_bin.fit_transform(label_list)
dummy_y
×
 
Out[100]:
 
      
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]]) 
…
 
  
In [ ]:
 
 
            
# 结论,两者效果差不多一样,但是 pd.get_dummies 更好用一些 
×
 
…









