GIT (Version Control) Commands You Need to Know for Your Data Science Project

Posted on 2020-09-23 16:33:52

Creating our Machine Learning models sometimes need a lot of iterations and changes. Sometimes we want to preserve the best performing model and try out new parameters and preprocessing to our codes that leads to overwriding our best model. We can save it to a new file but that doesn't look clean and professional and we may end up with a lot of unnecessary files.

Often times, we work in a team and each of us need to contribute and make changes to our code.

Suddenly the code we wrote doest not perform well on the same datasets. After so many hours of pointing fingers  to each other, we found out that John remove some feature engineering we add on our previous code.

What is GIT and Why You Should Use it on Your Data Science Project

GIT is a distributed version control that helps you track changes on your source code. It is easily to implement and learn. It helps you to have a flexible workflow and to help your team to maintain and manage your team code repository easily.

Git Installation

Intalling Git is very straight forward. You can install it here.

You also need to create a Github or Gitlab account.

I will use Github in this tutorial.

Basic Git Tutorial

Let say you want to create a model that predicts which passengers survived the Titanic shipwreck. 

You create a folder name Titanic, inside that folder you create a jupyter notebook file called logistic_regression.

190

To make it a Git repository, open your Terminal, navigate to where your project is then run


git init

If you don't know how to navigate through your file system on your terminal. You can read this short tutorial on Mac and Windows.

Run git status to see all untracked files that need to be committed.


git status
196

As you edit files, Git sees them as modified and because you've changed them since your last commit, you can see all untracked files via git status command.

Let's add some code to our jupyter notebook.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
train = pd.read_csv('titanic_train.csv')
train.head()
train['Age'] = train['Age'].mean()
train.drop('Cabin',axis=1,inplace=True)
sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)
train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
train = pd.concat([train,sex,embark],axis=1)
train.head()

# TRAIN/TEST
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1), 
                                                    train['Survived'], test_size=0.30, 
                                                    random_state=101)

from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

#EVALUATION
predictions = logmodel.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

I also added the titanic train and test dataset from Kaggle. You can download it here. Then put it inside the directory.

200

When we run git status, we'll see all our changes


git status
210

As you have noticed, we have a .ipynb_checkpoints folder that we did not create. Jupyter creates this folder in order to track down our updates/changes on our jupyter notebook. We don't want it on our code repository so we will create a .gitignore file to exclude it from the files that's being track down by GIT.

Create .gitignore file using your favorite text editor or IDE then add this code.


**/*.ipynb_checkpoints/
211

When we run git status again, the ipynb_checkpoints our now excluded to our repository. But why we need to exclude those folders/files? Because ipynb_checkpoints is not actually part of our project, it is just a temporary folder/files that jupyter creates in order to track down our local changes.

More on gitignore here

Now we are ready to add our changes and commit it. Think of it as saving our current progress.

To add all untracked files, run


git add .

To commit it, run


git commit -m "You can change the text here that describe what changes you did"
212

Let say we have something we want to try on our codes. Like we want to update on how to Impute the Age column.

From


train['Age'] = train['Age'].mean()

To


def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):

        if Pclass == 1:
            return 37

        elif Pclass == 2:
            return 29

        else:
            return 24

    else:
        return Age

train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

If we run git status again, we will see that our file was changed


git status
220

If we're already satisfied with our code, we can commit and push it to Github (or Gitlab).

I will use Github on this steps, if you are using Gitlab or other development platform, you may need to adjust a few things.

Create a new Github repository

223

Note the URL of our project repository

225

Then on our Terminal we need to run this command to link our local project to Github


git remote add origin <your project repository url>

Then we can now push it to our Github repository. Enter your user name and password if asked.


git push origin master
230

That's it , your project files are now in your Github Repository.

232

There are so much to learn in Git like branching, tagging, stashing etc. You can read more on their documentation page.

Thank you for reading this blog. Hope you learn something new on managing your codes using GIT.