Package Notes
pandas tricks¶
pd.cut()helps you cut a series. There are two ways to do so- By specifying how many categories we need like so
pd.cut(dataFrame, NO_OF_CATEGORIES) - By specifying the exact bins of categories like so
pd.cut(dataFrame, bins=[0., 1.5, 3.0, 4.5, np.inf], labels= [1,2,3,4,5]) - Labelling is done to give a name (in this case ordinal) to each of the categories
- By specifying how many categories we need like so
pd.DataFrame¶
-
#concept axis refers to the row, column or a higher dimensional point (e.g.
axis=0refers to rows,axis=1refers to columns.axis=3can refer to time, if we have data across years and each row corresponds to the same entity.) -
locandilochave this commonly known distinction that the first is used to index by label but the latter byindex. - But also know that when we take a subset of the dataframe (say, main) (lets say while test-train splitting of the data), the original
indexvalues of main df is preserved as the labels of the subsets. - e.g.
main = [1,2,3,4,5,6,7]are the default labels (for an unlabelled df main), and upon splitting it we get subsetsA = [1,3,4,5]andB = [2,6,7] - So now note that we cannot do
A.loc[range(1,5)]since we don't have label 2 in ourA. ButA.iloc[range(1,5)]will definitely work since indices are always in sequence.
Tricks¶
list(df)gives us a list of all the columns of thepd.DataFrame. Instead of usingdf.columnswhich gives us anpd.Indexof column names.
pd.Series¶
-
pd.Series.apply()applies a function to each element of the series and returns another series with the output of each of those function call in the same order. -
if you want the "label" to each of the items in a pandas series, its actually called
index
pd.plotting¶
housing.plot(
kind="scatter",x='longitude',y='latitude',
alpha=.4, #(2)
s=housing['population']/100, #(1)
label="population",
c="median_house_value", #(3)
colorbar=True,
cmap=plt.get_cmap('jet') #(4)
)
sis the radius of the circle.alphaopacity.cis the color (always give the name of the property (e.g."population"instead ofhousing['population']whenever possible, here we needed to make an adjustment to the size).cmapgives us the color map.
scatter_matrix is used to plot correlation maps.
numpy tricks¶
np.ndarray¶
- N-dimensional array
- Use
ilocto reference by index - Use
locto reference by label (check this: It is also possible to pass in an index, if label for the data is not available)
Intuition¶
Indices¶
- What does
X[:,:]mean? - How did you say all rows? How did you say all columns?
- So, what is
X[a:b,:], (whereaandbare integers)?- Rows/Records are lower dimensional, so the first index (here
a:b) always refer to rows. This is what we callaxis=0 - Columns/Headers/Attributes are higher dimensional, so the second index (here
:) always refer to column. This is what we callaxis=1 - Thus,
X[a:b,:]means, "select all columns from the records whose indices start fromaand do not equal or exceedb" - Similarly,
X[:,:c]means "select those columns with index \(<\)cfrom all records"
- Rows/Records are lower dimensional, so the first index (here
General Tricks¶
Typecasting¶
pandas.core.series.Series.astype(np.int8)to convert string to integers (or float to integers)