We are going to use the housing dataset from this.
python
housing.info()
text
<class 'pandas.core.frame.DataFrame'>Int64Index: 16512 entries, 12655 to 19773Data columns (total 13 columns):# Column Non-Null Count Dtype--- ------ -------------- -----0 longitude 16512 non-null float641 latitude 16512 non-null float642 housing_median_age 16512 non-null float643 total_rooms 16512 non-null float644 total_bedrooms 16512 non-null float645 population 16512 non-null float646 households 16512 non-null float647 median_income 16512 non-null float648 ocean_proximity 16512 non-null object9 income_cat 16512 non-null category10 rooms_per_household 16512 non-null float6411 bedrooms_per_room 16354 non-null float6412 population_per_household 16512 non-null float64dtypes: category(1), float64(11), object(1)memory usage: 1.7+ MB
Now, we utilize sklearn.base
’s BaseEstimator
and TransformerMixin
to make a custom transformer:
BaseEstimator
will give us a nice constructor class so that we don’t need take care of all the base*args
and**kwargs
TransformerMixin
will give us the.fit_transform()
function for free since we have have to define the.fit()
and.transform()
on our own. This was mentioned in Sklearn Design Consistency section.
Here is what the class looks like, we are trying to add the following attributes to our DF, so we are writing a transformation:
python
from sklearn.base import BaseEstimator, TransformerMixinrooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6class CombinedAttributesAdder(BaseEstimator, TransformerMixin):def __init__(self, add_bedrooms_per_room = True):self.add_bedrooms_per_room = add_bedrooms_per_roomdef fit(self, X, y=None):return selfdef transform(self, X):rooms_per_household = X[:, rooms_ix] / X[:, households_ix]population_per_household = X[:, population_ix] / X[:, households_ix]if self.add_bedrooms_per_room:bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]else:return np.c_[X, rooms_per_household, population_per_household]attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)housing_extra_attribs = attr_adder.transform(housing.values)
NOTE:
np.c_
is for concatenating arrays as columns.np.c_[X, rooms_per_household, population_per_household]
will concatenate the already 2dX
with the two columnsrooms_per_household
andpopulation_per_household
. IfX
wasthen the final output would be (the two new columns added)
Should practice this pattern multiple times to get comfortable and so that thinking about it becomes natural.