Creating our Machine Learning models sometimes need a lot of iterations and changes. Sometimes we want to preserve the best performing model and try out new parameters and preprocessing to our codes that leads to overwriding our best model. We can save it to a new file but that doesn't look clean and professional and we may end up with a lot of unnecessary files.
Often times, we work in a team and each of us need to contribute and make changes to our code.
Suddenly the code we wrote doest not perform well on the same datasets. After so many hours of pointing fingers to each other, we found out that John remove some feature engineering we add on our previous code.
GIT is a distributed version control that helps you track changes on your source code. It is easily to implement and learn. It helps you to have a flexible workflow and to help your team to maintain and manage your team code repository easily.
Let say you want to create a model that predicts which passengers survived the Titanic shipwreck.
You create a folder name Titanic, inside that folder you create a jupyter notebook file called logistic_regression.
 
                    To make it a Git repository, open your Terminal, navigate to where your project is then run
git init
Run git status to see all untracked files that need to be committed.
git status
 
                    As you edit files, Git sees them as modified and because you've changed them since your last commit, you can see all untracked files via git status command.
Let's add some code to our jupyter notebook.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
train = pd.read_csv('titanic_train.csv')
train.head()
train['Age'] = train['Age'].mean()
train.drop('Cabin',axis=1,inplace=True)
sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)
train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
train = pd.concat([train,sex,embark],axis=1)
train.head()
# TRAIN/TEST
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1), 
                                                    train['Survived'], test_size=0.30, 
                                                    random_state=101)
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
#EVALUATION
predictions = logmodel.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))
I also added the titanic train and test dataset from Kaggle. You can download it here. Then put it inside the directory.
 
                    When we run git status, we'll see all our changes
git status
 
                    As you have noticed, we have a .ipynb_checkpoints folder that we did not create. Jupyter creates this folder in order to track down our updates/changes on our jupyter notebook. We don't want it on our code repository so we will create a .gitignore file to exclude it from the files that's being track down by GIT.
Create .gitignore file using your favorite text editor or IDE then add this code.
**/*.ipynb_checkpoints/
 
                    When we run git status again, the ipynb_checkpoints our now excluded to our repository. But why we need to exclude those folders/files? Because ipynb_checkpoints is not actually part of our project, it is just a temporary folder/files that jupyter creates in order to track down our local changes.
More on gitignore here.
Now we are ready to add our changes and commit it. Think of it as saving our current progress.
To add all untracked files, run
git add .
To commit it, run
git commit -m "You can change the text here that describe what changes you did"
 
                    Let say we have something we want to try on our codes. Like we want to update on how to Impute the Age column.
From
train['Age'] = train['Age'].mean()
To
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)
If we run git status again, we will see that our file was changed
git status
 
                    If we're already satisfied with our code, we can commit and push it to Github (or Gitlab).
I will use Github on this steps, if you are using Gitlab or other development platform, you may need to adjust a few things.
Create a new Github repository
 
                    Note the URL of our project repository
 
                    Then on our Terminal we need to run this command to link our local project to Github
git remote add origin <your project repository url>
Then we can now push it to our Github repository. Enter your user name and password if asked.
git push origin master
 
                    That's it , your project files are now in your Github Repository.
 
                    There are so much to learn in Git like branching, tagging, stashing etc. You can read more on their documentation page.
Thank you for reading this blog. Hope you learn something new on managing your codes using GIT.