Use argparse

python argparse

Like the diagram above, we have a standard structure to organize our small projects:

  • The folder named data that contains our dataset

  • train.py file

  • The options.py file for specifying hyperparameters

First, we can create a file train.py in which we have the basic procedures for importing data, training the model on the training data and evaluating it on the test set:

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error

from options import train_options

df = pd.read_csv('data\hour.csv')
print(df.head())
opt = train_options()

X=df.drop(['instant','dteday','atemp','casual','registered','cnt'],axis=1).values
y =df['cnt'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

if opt.normalize == True:
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    
rf = RandomForestRegressor(n_estimators=opt.n_estimators,max_features=opt.max_features,max_depth=opt.max_depth)
model = rf.fit(X_train,y_train)
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_pred, y_test))
mae = mean_absolute_error(y_pred, y_test)
print("rmse: ",rmse)
print("mae: ",mae)

In the code, we also import the train_options function that is included in the options.py file. The latter file is a Python file from which we can change the hyperparameters considered in train.py:

import argparse

def train_options():
    parser = argparse.ArgumentParser()
    parser.add_argument("--normalize", default=True, type=bool, help='maximum depth')
    parser.add_argument("--n_estimators", default=100, type=int, help='number of estimators')
    parser.add_argument("--max_features", default=6, type=int, help='maximum of features',)
    parser.add_argument("--max_depth", default=5, type=int,help='maximum depth')
    opt = parser.parse_args()
    return opt

In this example, we use the argparse library, which is very popular for parsing command line arguments. First, we initialize the parser, and then we can add the arguments we want to access.

Here is an example of running the code:

python train.py

 

python argparse

To change the default values of hyperparameters, there are two ways. The first option is to set a different default value in the options.py file. The other option is to pass the hyperparameter value from the command line:

python train.py --n_estimators 200

We need to specify the name of the hyperparameter to be changed and the corresponding value.

python train.py --n_estimators 200 --max_depth 7

Using JSON files

python argparse

As before, we can keep a similar file structure. In this case, we replace the options.py file with a JSON file. In other words, we want to specify the values of the hyperparameters in the JSON file and pass them to the train.py file. JSON files can be a quick and intuitive alternative to the argparse library, which uses key-value pairs to store data. Below we create an options.json file that contains the data we need to pass to other code later.

{
"normalize":true,
"n_estimators":100,
"max_features":6,
"max_depth":5 
}

As you can see above, it is very similar to the Python dictionary. But unlike a dictionary, it contains data in text/string format. In addition, there are some common data types with slightly different syntax. For example, Boolean values are false/true, while Python recognizes False/True. other possible values in JSON are arrays, which are represented as Python lists in square brackets.

The beauty of using JSON data in Python is that it can be converted to a Python dictionary with the load method:

f = open("options.json", "rb")
parameters = json.load(f)

To access a specific item, we just need to quote its key name in square brackets:

if parameters["normalize"] == True:
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
rf=RandomForestRegressor(n_estimators=parameters["n_estimators"],max_features=parameters["max_features"],max_depth=parameters["max_depth"],random_state=42)
model = rf.fit(X_train,y_train)
y_pred = model.predict(X_test)

Using YAML files

python argparse

A final option is to exploit the potential of YAML. As with JSON files, we read YAML files in Python code as dictionaries to access the values of hyperparameters. YAML is a human-readable data representation language where hierarchies are represented using double space characters rather than brackets like in JSON files. Here we show what the options.yaml file will contain:

normalize: True 
n_estimators: 100
max_features: 6
max_depth: 5

In train.py, we open the options.yaml file, which will always be converted to a Python dictionary using the load method, this time imported from the yaml library:

import yaml
f = open('options.yaml','rb')
parameters = yaml.load(f, Loader=yaml.FullLoader)

As before, we can access the value of the hyperparameter using the syntax required by the dictionary.

Conclusion

The configuration file compiles very fast, while argparse requires one line of code for each argument we want to add.

So we should choose the most appropriate way for our different situations

For example, if we need to add comments to parameters, JSON is not suitable because it does not allow comments, while YAML and argparse may be perfect for this.

Related articles

Seaborn draws 11 bar charts

This article is about how to use seaborn to draw various bar charts Base Bar Chart Horizontal Bar Chart Title Settings DataFrame based drawing hue parameter setting Color processing Multi-dimensional processing

Tips for speeding up pandas

When people talk about data analysis, the most mentioned languages are Python and SQL. python is suitable for data analysis because it has many powerful third-party libraries to assist, pandas being one of them. pandas is described in the documentation as

Speed up pandas with Modin

Modin exists for a reason: to change one line of code to speed up the pandas workflow.Pandas needs no introduction in the field of data science, providing high-performance, easy-to-use data structures and data analysis tools. However, when dealing

How to open a txt file in python

Two ways to open a file f = open("data.txt","r") #Setting the file object f.close() #Close file #For convenience, and to avoid forgetting to close the file object, you can use the following instead

python how to add new content to the dictionary

Adding new content to the dictionary is done by adding new key/value pairs, modifying or deleting existing key/value pairs as shown in the following examples: #!/usr/bin/python dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'}

python how to get the current file path

Python method for getting the current path. import os,sys Use sys.path[0], sys.argv[0], os.getcwd(), os.path.abspath(__file__), os.path.realpath(__file__)

How to read csv files in python

Comma-Separated Values (CSV, sometimes called character-separated values because the separating characters can also be other than commas), whose files store tabular data (numbers and text) in plain text

How to output vertically in python

Example: The output is the following case H e l l o W o r l d This can be done using a for loop. for name in "Hello World": print(name) This can also be done using the join method print("\n".join("Hello World"))