When to use custom index instead of regular columns in Pandas

In pandas, you can replace the default integer-based index with an index consisting of any number of columns using set_index()

.

What confuses me is when you want to do it. Whether the series is a column or part of an index, you can filter the values ​​in the series using boolean indexing for columns or xs () for rows. You can sort by column or index using sort_values()

or sort_index()

.

The only real difference I've come across is that indexes have problems with duplicate values, so it seems like using the index is more restrictive if anything.

Why then would I like to convert my columns to an index in Pandas?

+3


source to share


2 answers


In my opinion, custom indexes are good for quickly selecting data.

They are also useful for aligning data for mapping , for arithmetic operations where an index is used to align data, to join data, and to get the minimum or maximum rows for each group.

DatetimeIndex

nice for partial indexing of rows , for resampling .

But you're right, duplicate index is problematic especially for reindexing .



Docs :

  • Identifies data (i.e., provides metadata) using known indicators that are important for analysis, visualization, and display of an interactive console.
  • Includes automatic and explicit data alignment.
  • Allows for intuitive retrieval and customization of subsets of a dataset

Also you can check Modern pandas - Indexes , direct link .

+2


source


Since 0.20.2, some methods, such as .unstack()

, only work with indices.

Custom indexes, especially time indexing, can be particularly useful. In addition to resampling and aggregating for whatever time interval (the latter is done with .groupby()

c pd.TimeGrouper()

) that is required DateTimeIndex

, you can call a method .plot()

on a column for example. df['column'].plot()

and get a time series graph immediately.



The most useful, however, is alignment: for example, suppose you have two datasets that you want to add; they are labeled sequentially, but sorted in a different order. If you set your labels as the index of your dataframe, you can simply add that data together and not worry about ordering the data.

+1


source







All Articles