Best way to detect GPT-generated (AI) text using Python.

One approach to detect GPT-generated text using Python is to use a language model classifier that has been trained to distinguish between GPT-generated text and human-written text. Here are the general steps to do this:

  1. Obtain a dataset of text that contains both GPT-generated text and human-written text. You can either create this dataset yourself by generating text with GPT and collecting human-written text from various sources, or you can use an existing dataset such as the GPT-3 training data or the Grover dataset.
  2. Preprocess the dataset by cleaning the text, removing any irrelevant information, and splitting the dataset into training and test sets.
  3. Train a language model classifier on the preprocessed dataset. You can use various machine learning algorithms such as logistic regression, support vector machines, or neural networks to train the classifier.
  4. Evaluate the performance of the classifier on the test set to measure its accuracy, precision, recall, and F1 score.
  5. Use the trained classifier to detect GPT-generated text by feeding it with new text and predicting whether it was generated by GPT or written by a human.

Here’s an example code snippet using Python and sickout library to implement this approach:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset containing GPT-generated and human-written text
df = pd.read_csv('dataset.csv')

# Preprocess the dataset
df = df.dropna()
df = df[df['text'].apply(lambda x: len(x.split())) > 10] # Remove short texts
df['class'] = df['source'].apply(lambda x: 1 if x == 'gpt' else 0) # Assign class labels

# Split the dataset into training and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]

# Convert text into numerical features using TF-IDF vectorization
vectorizer = TfidfVectorizer()
train_X = vectorizer.fit_transform(train_df['text'])
test_X = vectorizer.transform(test_df['text'])

# Train a logistic regression classifier
clf = LogisticRegression(), train_df['class'])

# Evaluate the performance of the classifier on the test set
test_y_pred = clf.predict(test_X)
accuracy = accuracy_score(test_df['class'], test_y_pred)
print('Accuracy:', accuracy)

# Use the trained classifier to detect GPT-generated text
new_text = 'This is a test sentence generated by GPT'
new_X = vectorizer.transform([new_text])
pred_class = clf.predict(new_X)[0]
if pred_class == 1:
    print('The text was generated by GPT')
    print('The text was written by a human')

You must input dataset like following to train model.

1humanThis is a sample text written by a human.
2gptThis is a sample text generated by GPT.
3humanAnother example of human-written text.
4gptAnother example of GPT-generated text.

In this format, each row represents a text sample and includes three columns:

  • id: A unique identifier for the text sample.
  • source: A categorical variable indicating whether the text was written by a human (human) or generated by GPT (gpt).
  • text: The actual text content of the sample.

Here are some example texts that could be included in this dataset:

1humanI am going to the store to buy some groceries.
2gptThe store was packed with people, all pushing and shoving to get to the shelves first.
3humanThe sun was setting over the mountains, casting a warm orange glow over the landscape.
4gptThe mountains were tall and imposing, casting a dark shadow over the valley below.
5humanShe looked out the window and saw a beautiful rainbow in the sky.
6gptThe sky was filled with colors, each one more vibrant than the last.

I provided is an example of how to train a machine learning model to classify text as either GPT-generated or human-written.

The code loads a dataset that contains a mix of GPT-generated and human-written text, preprocesses the dataset, and then trains a logistic regression model using the scikit-learn library. The logistic regression model is a binary classifier that learns to distinguish between the two classes of text (GPT-generated or human-written) based on the features extracted from the text.

After training the model, the code evaluates its performance on a test set using the accuracy metric. Finally, the trained model can be used to classify new text as either GPT-generated or human-written by feeding it into the model and predicting its class.

Akalanka Ekanayake
Dilakshan Akalanka Ekanayake, who is well-known as Akalanka Ekanayake is a popular and skilled music editor and programmer based in Sri Lanka. He is a musical artist, Cybersecurity researcher, Software engineer, and the Founder & CEO of Cyberscap. The passion that Akalanka has towards both music and tech has helped him to achieve a lot in both industries. Some of the most notable projects that he worked on include Crimes of the Future, Hacker(2019 ) and France(2021).

Related Stories



Best way to detect GPT-generated (AI) text using Python.

One approach to detect GPT-generated text using Python is to use a language model...

Scraping a website using the BeautifulSoup library (Python)

Full Code import requests from bs4 import BeautifulSoup url = "" # send a request to the website response...

Creating a ChatGPT Discord Bot Step by Step (2023)

Prerequisites: Discord account Python 3.6 or higher OpenAI API Key Basic understanding of Python programming Setting up the environment: Install...

Starting Laravel 9 Project Within 8 Steps

Starting a Laravel project can seem overwhelming, but it's actually a simple process when...

Learn How To Manage Your Stress Using 10 Tips!

These days it is hard not to be weighed down once in a while....

Online Fraud Keywords Explained (Part 2)

Hey, Before reading this please read my first article about this title. (Part 1) PAYMENT...

Popular Categories



Please enter your comment!
Please enter your name here