Use argparse
Like the diagram above, we have a standard structure to organize our small projects:
-
The folder named data that contains our dataset
-
train.py file
-
The options.py file for specifying hyperparameters
First, we can create a file train.py in which we have the basic procedures for importing data, training the model on the training data and evaluating it on the test set:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from options import train_options
df = pd.read_csv('data\hour.csv')
print(df.head())
opt = train_options()
X=df.drop(['instant','dteday','atemp','casual','registered','cnt'],axis=1).values
y =df['cnt'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
if opt.normalize == True:
scaler = StandardScaler()
X = scaler.fit_transform(X)
rf = RandomForestRegressor(n_estimators=opt.n_estimators,max_features=opt.max_features,max_depth=opt.max_depth)
model = rf.fit(X_train,y_train)
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_pred, y_test))
mae = mean_absolute_error(y_pred, y_test)
print("rmse: ",rmse)
print("mae: ",mae)
In the code, we also import the train_options function that is included in the options.py file. The latter file is a Python file from which we can change the hyperparameters considered in train.py:
import argparse
def train_options():
parser = argparse.ArgumentParser()
parser.add_argument("--normalize", default=True, type=bool, help='maximum depth')
parser.add_argument("--n_estimators", default=100, type=int, help='number of estimators')
parser.add_argument("--max_features", default=6, type=int, help='maximum of features',)
parser.add_argument("--max_depth", default=5, type=int,help='maximum depth')
opt = parser.parse_args()
return opt
In this example, we use the argparse library, which is very popular for parsing command line arguments. First, we initialize the parser, and then we can add the arguments we want to access.
Here is an example of running the code:
python train.py
To change the default values of hyperparameters, there are two ways. The first option is to set a different default value in the options.py file. The other option is to pass the hyperparameter value from the command line:
python train.py --n_estimators 200
We need to specify the name of the hyperparameter to be changed and the corresponding value.
python train.py --n_estimators 200 --max_depth 7
Using JSON files
As before, we can keep a similar file structure. In this case, we replace the options.py file with a JSON file. In other words, we want to specify the values of the hyperparameters in the JSON file and pass them to the train.py file. JSON files can be a quick and intuitive alternative to the argparse library, which uses key-value pairs to store data. Below we create an options.json file that contains the data we need to pass to other code later.
{
"normalize":true,
"n_estimators":100,
"max_features":6,
"max_depth":5
}
As you can see above, it is very similar to the Python dictionary. But unlike a dictionary, it contains data in text/string format. In addition, there are some common data types with slightly different syntax. For example, Boolean values are false/true, while Python recognizes False/True. other possible values in JSON are arrays, which are represented as Python lists in square brackets.
The beauty of using JSON data in Python is that it can be converted to a Python dictionary with the load method:
f = open("options.json", "rb")
parameters = json.load(f)
To access a specific item, we just need to quote its key name in square brackets:
if parameters["normalize"] == True:
scaler = StandardScaler()
X = scaler.fit_transform(X)
rf=RandomForestRegressor(n_estimators=parameters["n_estimators"],max_features=parameters["max_features"],max_depth=parameters["max_depth"],random_state=42)
model = rf.fit(X_train,y_train)
y_pred = model.predict(X_test)
Using YAML files
A final option is to exploit the potential of YAML. As with JSON files, we read YAML files in Python code as dictionaries to access the values of hyperparameters. YAML is a human-readable data representation language where hierarchies are represented using double space characters rather than brackets like in JSON files. Here we show what the options.yaml file will contain:
normalize: True
n_estimators: 100
max_features: 6
max_depth: 5
In train.py, we open the options.yaml file, which will always be converted to a Python dictionary using the load method, this time imported from the yaml library:
import yaml
f = open('options.yaml','rb')
parameters = yaml.load(f, Loader=yaml.FullLoader)
As before, we can access the value of the hyperparameter using the syntax required by the dictionary.
Conclusion
The configuration file compiles very fast, while argparse requires one line of code for each argument we want to add.
So we should choose the most appropriate way for our different situations
For example, if we need to add comments to parameters, JSON is not suitable because it does not allow comments, while YAML and argparse may be perfect for this.