SMS spam microservice, Python vs Golang
Published on Mar 14, 2025 in Benchmarks
As I have already explained in my previous language benchmarks, testing the performance of a language in a non-concrete situation is difficult. So I decided to create a SMS spam detection microservice in Python and Golang.
For this I will use a naive Bayes classifier for multinomial models. It’s a machine learning algorithm commonly used for spam detection, but also for emotion recognition…
To train the model, I’m going to use this dataset. The data is labeled spam or ham for legitimate content (this is a common joke in the anti-spam field).
I’m going to split the data into two unequal parts. 80% to train the model and 20% for evaluation.
Here is the code I used:
#!/usr/bin/python3
import csv
import json
def convert_dataset(src_file, train_file, eval_file, tx_eval = 0.2):
train_array = []
eval_array = []
train_spam = 0
eval_spam = 0
with open(src_file, encoding='utf-8') as file:
csv_reader = csv.DictReader(file)
total_rows = sum(1 for _ in csv_reader)
eval_rows = tx_eval * total_rows
file.seek(0)
csv_reader = csv.DictReader(file)
for row in csv_reader:
if csv_reader.line_num <= eval_rows:
eval_array.append(row)
if row['label'] == 'spam':
eval_spam += 1
else:
train_array.append(row)
if row['label'] == 'spam':
train_spam += 1
with open(train_file, 'w', encoding='utf-8') as file:
json_string = json.dumps(train_array, indent=4)
file.write(json_string)
with open(eval_file, 'w', encoding='utf-8') as file:
json_string = json.dumps(eval_array, indent=4)
file.write(json_string)
print('Train spam:', train_spam / len(train_array))
print('Eval spam:', eval_spam / len(eval_array))
src_file = 'sms-spam-collection.csv'
train_file = 'train.json'
eval_file = 'eval.json'
convert_dataset(src_file, train_file, eval_file)
On large datasets, the JSONL format is used rather than JSON. This eliminates the need to load and decode a large JSON array at once. But since the dataset is not very big, I used normal JSON.
I check the percentage of spam in each of my sub-datasets. In case of imbalance I would have made a resampling. Here I have about 13% spam in train and 15% in eval. I haven’t checked with a χ² if the difference is significant or not, but at first glance, it looks ok.
In Python, I’ll use Scikit-learn for the classifier and Flask for the API part. As everything is already done by libraries, the code will be very simple.
#!/usr/bin/python3
import json
from flask import Flask, request, jsonify
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
import pandas as pd
app = Flask(__name__)
def load_data():
with open('../dataset/train.json', 'r') as file:
data = json.load(file)
df = pd.DataFrame(data)
return df['message'], df['label']
def train_model():
messages, labels = load_data()
model = make_pipeline(CountVectorizer(), MultinomialNB())
model.fit(messages, labels)
return model
model = train_model()
@app.route('/predict', methods=['PUT'])
def predict():
data = request.get_json()
if not data:
return jsonify({'error': 'No data provided'}), 400
message = data.get('message')
if not message:
return jsonify({'error': 'No message provided'}), 400
prediction = model.predict([message])
return jsonify({'data': prediction[0]})
if __name__ == '__main__':
app.run(port=8000, debug=True)
In Golang, I’m going to use Gofr for the API part. I’m more used to Gin, but for a pure API, Gofr seems more suitable to me. However, I’ll have to implement the classifier myself. As a result, it requires more work.
Classifier package:
package classifier
import (
"strings"
)
type DatasetFormat struct {
Label string `json:"label"`
Message string `json:"message"`
}
type MultinomialNB struct {
classCounts map[string]int
wordCounts map[string]map[string]int
totalDocuments int
vocab map[string]struct{}
}
func NewMultinomialNB() *MultinomialNB {
return &MultinomialNB{
classCounts: make(map[string]int),
wordCounts: make(map[string]map[string]int),
vocab: make(map[string]struct{}),
}
}
func (nb *MultinomialNB) Train(documents []DatasetFormat) {
for _, doc := range documents {
nb.classCounts[doc.Label]++
nb.totalDocuments++
words := strings.Fields(doc.Message)
if _, exists := nb.wordCounts[doc.Label]; !exists {
nb.wordCounts[doc.Label] = make(map[string]int)
}
for _, word := range words {
nb.wordCounts[doc.Label][word]++
nb.vocab[word] = struct{}{}
}
}
}
func (nb *MultinomialNB) Predict(doc string) string {
words := strings.Fields(doc)
classProbabilities := make(map[string]float64)
for class := range nb.classCounts {
classProbabilities[class] = float64(nb.classCounts[class]) / float64(nb.totalDocuments)
for _, word := range words {
wordCount := float64(nb.wordCounts[class][word])
totalWordsInClass := float64(0)
for _, count := range nb.wordCounts[class] {
totalWordsInClass += float64(count)
}
classProbabilities[class] *= (wordCount + 1) / (totalWordsInClass + float64(len(nb.vocab)))
}
}
var bestClass string
var maxProb float64
for class, prob := range classProbabilities {
if prob > maxProb {
maxProb = prob
bestClass = class
}
}
return bestClass
}
Main:
package main
import (
"encoding/json"
"log"
"os"
"spam/classifier"
"gofr.dev/pkg/gofr"
)
var nb *classifier.MultinomialNB
type PostData struct {
Message string `json:"message"`
}
func HandlePredict(ctx *gofr.Context) (any, error) {
var data PostData
err := ctx.Bind(&data)
if err != nil {
return nil, err
}
return nb.Predict(data.Message), nil
}
func main() {
data, err := os.ReadFile("../dataset/train.json")
if err != nil {
log.Fatal(err)
}
var messages []classifier.DatasetFormat
err = json.Unmarshal(data, &messages)
if err != nil {
log.Fatal(err)
}
nb = classifier.NewMultinomialNB()
nb.Train(messages)
app := gofr.New()
app.PUT("/predict", HandlePredict)
app.Run()
}
I also created this script to play the eval dataset sequentially, measure the accuracy of the model, and also the average time per query.
#!/usr/bin/python3
import json
import requests
from time import time
api_url = "http://localhost:8000/predict"
true_positive = 0
false_positive = 0
true_negative = 0
false_negative = 0
with open('eval.json', 'r') as f:
dataset = json.load(f)
t = time()
for entry in dataset:
message = entry['message']
expected_label = entry['label']
response = requests.put(api_url, json={"message": message})
if response.status_code == 200:
result = response.json().get("data")
if result == "spam":
if expected_label == "spam":
true_positive += 1
else:
false_positive += 1
else:
if expected_label == "ham":
true_negative += 1
else:
false_negative += 1
else:
print(f"API Error for message: {message}. Statut: {response.status_code}")
total = len(dataset)
if total > 0:
accuracy = 100 * (true_positive + true_negative) / total
average_time = 1000 * (time()-t) / total
print(f"Total messages: {total}")
print(f"True Positive: {true_positive}")
print(f"False Positive: {false_positive}")
print(f"True Negative: {true_negative}")
print(f"False Negative: {false_negative}")
print(f"Accuracy: {accuracy:.2f}%")
print(f"Average time: {average_time:.2f} ms")
else:
print("The dataset is empty.")
Here’s the result with Python:
Total messages: 1113
True Positive: 157
False Positive: 5
True Negative: 940
False Negative: 11
Accuracy: 98.56%
Average time: 3.07 ms
And here is the one with Golang:
Total messages: 1113
True Positive: 159
False Positive: 28
True Negative: 917
False Negative: 9
Accuracy: 96.68%
Average time: 5.90 ms
We realize that the accuracy is very good.
Strangely, it is better with Python than with Golang. It’s weird because MultinomialNB is a non-stochastic algorithm and for the same data, the result should be the same. This is simply explained by a difference in word tokenization.
In python, I used CountVectorizer provided by Scikit-learn. In Golang, I simply used strings.Fields. By default, CountVectorizer uses r”(?u)\b\w\w+\b” as token pattern, whereas strings.Fields is based on unicode.IsSpace as word separator and accepts the words of a single letter.
We can see that the tokenization of Scikit-learn is more efficient than mine.
We also see that the Python version is twice as fast as my Golang implementation.
This seems to contradict everything I’ve said about Python in my previous benchmark articles. But this is explained by the fact that Scikit-learn uses massively numpy which is coded in C and it’s obviously much optimized than my Golang classifier made in a hurry. My code relies too much on maps. It’s simple and readable, but not very efficient. That said, memory management in Golang is much slower than in C.
When you look at my previous benchmarks, it’s easy to draw an extreme conclusion and want to ban Python. But Python has good and well-optimized libraries and it is sometimes a very good choice.
But as nothing is ever simple. Let’s take a look at how our two microservices handle the load. I wrote this new script that floods the API with parallel requests.
#!/usr/bin/python3
import json
import aiohttp
import asyncio
from time import time
from aiohttp import ClientSession
api_url = "http://localhost:8000/predict"
true_positive = 0
false_positive = 0
true_negative = 0
false_negative = 0
async def fetch(session: ClientSession, message: str):
async with session.put(api_url, json={"message": message}) as response:
return await response.json(), response.status
async def process_entry(session: ClientSession, entry):
global true_positive, false_positive, true_negative, false_negative
message = entry['message']
expected_label = entry['label']
result, status_code = await fetch(session, message)
if status_code == 200:
result_label = result.get("data")
if result_label == "spam":
if expected_label == "spam":
true_positive += 1
else:
false_positive += 1
else:
if expected_label == "ham":
true_negative += 1
else:
false_negative += 1
else:
print(f"API Error for message: {message}. Statut: {status_code}")
async def main():
global true_positive, false_positive, true_negative, false_negative
with open('eval.json', 'r') as f:
dataset = json.load(f)
total = len(dataset)
if total == 0:
print("The dataset is empty.")
return
async with ClientSession() as session:
tasks = []
for entry in dataset:
tasks.append(process_entry(session, entry))
await asyncio.gather(*tasks)
accuracy = 100 * (true_positive + true_negative) / total
print(f"Total messages: {total}")
print(f"True Positive: {true_positive}")
print(f"False Positive: {false_positive}")
print(f"True Negative: {true_negative}")
print(f"False Negative: {false_negative}")
print(f"Accuracy: {accuracy:.2f}%")
if __name__ == "__main__":
start_time = time()
asyncio.run(main())
end_time = time()
average_time = 1000 * (end_time - start_time) / len(open('eval.json').readlines())
print(f"Average time: {average_time:.2f} ms")
Here’s the result with Python:
Total messages: 1113
True Positive: 157
False Positive: 5
True Negative: 940
False Negative: 11
Accuracy: 98.56%
Average time: 0.45 ms
And here is the one with Golang:
Total messages: 1113
True Positive: 159
False Positive: 28
True Negative: 917
False Negative: 9
Accuracy: 96.68%
Average time: 0.27 ms
This time, the result is reversed. Without too many surprises, Python/Flask is less able to withstand the load than Golang/Gofr. In a real-life situation, it might be worth making a more in-depth comparison with a tool like wrk, to check when Golang becomes more interesting than Python.
To be honest, I’ve sent more than a thousand requests in parallel and that’s not the right way to do things. If in real conditions, you have this kind of need, it would be better to send requests in batch to avoid saturating the service with http management.
But if you wanted to use Golang, you would have to improve word tokenization for maximum accuracy. Python will require less work than Golang, and that’s why this language is so often used. Especially since for a microservice, you can easily create replicas when you reach saturation.