Encoder

Example

To use the Encoder class, simply call it as follows:

from hivecore.preprocess import Encoder

# Define the data we are going to use. This can also be Polars or Pyspark dataframes.
pandas_data = pd.DataFrame({
    'binary_col': (pd.Series([0, 1], dtype='int64').sample(n=1000, replace=True)).reset_index(drop=True),
    'ordinal_col': (pd.Series(range(0, 10), dtype='int64').sample(n=1000, replace=True)).reset_index(drop=True),
    'one_hot_col': (pd.Series(range(0, 5), dtype='int64').sample(n=1000, replace=True)).reset_index(drop=True),
    'label_col': (pd.Series(range(0, 100), dtype='int64').sample(n=1000, replace=True)).reset_index(drop=True),
    'color_col': (pd.Series(['red', 'green', 'blue', 'yellow']).sample(n=1000, replace=True)).reset_index(drop=True),
    'target_col': (pd.Series([0, 1], dtype='int64').sample(n=1000, replace=True)).reset_index(drop=True),
})

# if no columns are specified, the process will try to run the method in all the columns. If no method is provided, it will try to select the best method for the columns data.
print(Encoder(pandas_data).encode(method='frequency', columns=['color_col']))

print('-'*100)

# If target_col is not defined, it will take by default the last column of the dataframe as the target_col
print(Encoder(pandas_data).encode(method='target', columns=['color_col'], target_col=target_col))

Expected Output

The output may look something like this, and may vary as it is random synthetic data we are feeding this example.

binary_col

ordinal_col

one_hot_col

label_col

color_col

target_col

color_col_encoded

1

5

1

49

yellow

0

0.518519

1

4

2

66

red

1

0.424370

1

9

3

19

yellow

0

0.518519

1

7

2

7

red

1

0.424370

0

9

2

25

red

0

0.424370

1

5

1

49

yellow

0

0.270

1

4

2

66

red

1

0.238

1

9

3

19

yellow

0

0.270

1

7

2

7

red

1

0.238

0

9

2

25

red

0

0.238


binary_col

ordinal_col

one_hot_col

label_col

color_col

target_col

color_col_encoded

1

8

2

85

yellow

1

0.270

1

8

1

57

green

0

0.255

0

4

4

0

blue

0

0.237

1

6

2

78

blue

0

0.237

0

5

4

9

blue

1

0.237

1

8

2

85

yellow

1

0.518519

1

8

1

57

green

0

0.513725

0

4

4

0

blue

0

0.459916

1

6

2

78

blue

0

0.459916

0

5

4

9

blue

1

0.459916