Encoder
Example
To use the Encoder class, simply call it as follows:
from hivecore.preprocess import Encoder
# Define the data we are going to use. This can also be Polars or Pyspark dataframes.
pandas_data = pd.DataFrame({
'binary_col': (pd.Series([0, 1], dtype='int64').sample(n=1000, replace=True)).reset_index(drop=True),
'ordinal_col': (pd.Series(range(0, 10), dtype='int64').sample(n=1000, replace=True)).reset_index(drop=True),
'one_hot_col': (pd.Series(range(0, 5), dtype='int64').sample(n=1000, replace=True)).reset_index(drop=True),
'label_col': (pd.Series(range(0, 100), dtype='int64').sample(n=1000, replace=True)).reset_index(drop=True),
'color_col': (pd.Series(['red', 'green', 'blue', 'yellow']).sample(n=1000, replace=True)).reset_index(drop=True),
'target_col': (pd.Series([0, 1], dtype='int64').sample(n=1000, replace=True)).reset_index(drop=True),
})
# if no columns are specified, the process will try to run the method in all the columns. If no method is provided, it will try to select the best method for the columns data.
print(Encoder(pandas_data).encode(method='frequency', columns=['color_col']))
print('-'*100)
# If target_col is not defined, it will take by default the last column of the dataframe as the target_col
print(Encoder(pandas_data).encode(method='target', columns=['color_col'], target_col=target_col))
Expected Output
The output may look something like this, and may vary as it is random synthetic data we are feeding this example.
binary_col |
ordinal_col |
one_hot_col |
label_col |
color_col |
target_col |
color_col_encoded |
1 |
5 |
1 |
49 |
yellow |
0 |
0.518519 |
1 |
4 |
2 |
66 |
red |
1 |
0.424370 |
1 |
9 |
3 |
19 |
yellow |
0 |
0.518519 |
1 |
7 |
2 |
7 |
red |
1 |
0.424370 |
0 |
9 |
2 |
25 |
red |
0 |
0.424370 |
… |
… |
… |
… |
… |
… |
… |
1 |
5 |
1 |
49 |
yellow |
0 |
0.270 |
1 |
4 |
2 |
66 |
red |
1 |
0.238 |
1 |
9 |
3 |
19 |
yellow |
0 |
0.270 |
1 |
7 |
2 |
7 |
red |
1 |
0.238 |
0 |
9 |
2 |
25 |
red |
0 |
0.238 |
binary_col |
ordinal_col |
one_hot_col |
label_col |
color_col |
target_col |
color_col_encoded |
1 |
8 |
2 |
85 |
yellow |
1 |
0.270 |
1 |
8 |
1 |
57 |
green |
0 |
0.255 |
0 |
4 |
4 |
0 |
blue |
0 |
0.237 |
1 |
6 |
2 |
78 |
blue |
0 |
0.237 |
0 |
5 |
4 |
9 |
blue |
1 |
0.237 |
… |
… |
… |
… |
… |
… |
… |
1 |
8 |
2 |
85 |
yellow |
1 |
0.518519 |
1 |
8 |
1 |
57 |
green |
0 |
0.513725 |
0 |
4 |
4 |
0 |
blue |
0 |
0.459916 |
1 |
6 |
2 |
78 |
blue |
0 |
0.459916 |
0 |
5 |
4 |
9 |
blue |
1 |
0.459916 |