Package Notes
pandas
tricks¶
pd.cut()
helps you cut a series. There are two ways to do so- By specifying how many categories we need like so
pd.cut(dataFrame, NO_OF_CATEGORIES)
- By specifying the exact bins of categories like so
pd.cut(dataFrame, bins=[0., 1.5, 3.0, 4.5, np.inf], labels= [1,2,3,4,5])
- Labelling is done to give a name (in this case ordinal) to each of the categories
- By specifying how many categories we need like so
pd.DataFrame
¶
-
#concept axis refers to the row, column or a higher dimensional point (e.g.
axis=0
refers to rows,axis=1
refers to columns.axis=3
can refer to time, if we have data across years and each row corresponds to the same entity.) -
loc
andiloc
have this commonly known distinction that the first is used to index by label but the latter byindex
. - But also know that when we take a subset of the dataframe (say, main) (lets say while test-train splitting of the data), the original
index
values of main df is preserved as the labels of the subsets. - e.g.
main = [1,2,3,4,5,6,7]
are the default labels (for an unlabelled df main), and upon splitting it we get subsetsA = [1,3,4,5]
andB = [2,6,7]
- So now note that we cannot do
A.loc[range(1,5)]
since we don't have label 2 in ourA
. ButA.iloc[range(1,5)]
will definitely work since indices are always in sequence.
Tricks¶
list(df)
gives us a list of all the columns of thepd.DataFrame
. Instead of usingdf.columns
which gives us anpd.Index
of column names.
pd.Series
¶
-
pd.Series.apply()
applies a function to each element of the series and returns another series with the output of each of those function call in the same order. -
if you want the "label" to each of the items in a pandas series, its actually called
index
pd.plotting
¶
housing.plot(
kind="scatter",x='longitude',y='latitude',
alpha=.4, #(2)
s=housing['population']/100, #(1)
label="population",
c="median_house_value", #(3)
colorbar=True,
cmap=plt.get_cmap('jet') #(4)
)
s
is the radius of the circle.alpha
opacity.c
is the color (always give the name of the property (e.g."population"
instead ofhousing['population']
whenever possible, here we needed to make an adjustment to the size).cmap
gives us the color map.
scatter_matrix
is used to plot correlation maps.
numpy
tricks¶
np.ndarray
¶
- N-dimensional array
- Use
iloc
to reference by index - Use
loc
to reference by label (==check this==: It is also possible to pass in an index, if label for the data is not available)
Intuition¶
Indices¶
- What does
X[:,:]
mean? - How did you say all rows? How did you say all columns?
- So, what is
X[a:b,:]
, (wherea
andb
are integers)?- Rows/Records are lower dimensional, so the first index (here
a:b
) always refer to rows. This is what we callaxis=0
- Columns/Headers/Attributes are higher dimensional, so the second index (here
:
) always refer to column. This is what we callaxis=1
- Thus,
X[a:b,:]
means, "select all columns from the records whose indices start froma
and do not equal or exceedb
" - Similarly,
X[:,:c]
means "select those columns with index \(<\)c
from all records"
- Rows/Records are lower dimensional, so the first index (here
General Tricks¶
Typecasting¶
pandas.core.series.Series
.astype(np.int8)
to convert string to integers (or float to integers)