In our example we can see that if we drop rows with more than 30 missing values, we only lose a few entries. We might want to eliminate them first before thinking about dropping potentially relevant features.Īnd lastly, the additional statistics at the top and the right side give us valuable information regarding thresholds we can use for dropping rows or columns with many missing values. Secondly, we can often times see patterns of missing rows stretching across many features. These are candidates for dropping, while those with fewer missing values might benefit from imputation. Firstly, we can identify columns where all or most of the values are missing. This single plot already shows us a number of important things. Not only for your own understanding of what you are dealing with, but also to produce plots you can show to supervisors, customers or anyone else looking to get a higher level representation and explanation of the data. Rather it is a collection of functions which you can - and probably should - call every time you start working on a new project or dataset. This package is not meant to provide an Auto-ML style API. It is up to you if you stick to sensible, yet sometimes conservative, default parameters or customize the experience by adjusting them according to your needs. These functions require nothing but a Pandas DataFrame of any size and any datatypes and can be accessed through simple one line calls to gain insight into your data, clean up your DataFrames and visualize relationships between features. Over the past couple of months I’ve implemented a range of functions which I frequently use for virtually any data analysis and preprocessing task, irrespective of the dataset or ultimate goal.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |