Inside GPTZero AI Detection π
7 'components' behind our new AI detection model + GPTZero v2 API launch.
GPTZero just had our first ever machine learning retreat at Niagara Falls, Canada!
Packed with code sprints and waterfalls, we wanted to share some of the things cooked up in the action-packed week β including delicious home-made meals, and major ML updates for GPTZero.
Mission Open API
February 2023, we launched GPTZeroβs API.
Many AI detectors, including much larger companies, never ended up releasing a publicly available API. Turn-it-in never turned-it on. Neither did OpenAI when they launched their basic AI classifier in February.
We wanted to operate differently because we want to grow GPTZero with transparency as a core value. Our vision for AI detection is of an entire community, from academia and industry working together, instead of alone, to make critically needed progress in detection technology.
As a result, weβre proud to say that a diverse range of researchers, educators, users have integrated and tested GPTZero!
Non-profits like Coding-it-Forward applied it in their application cycle1, shedding a first light into AI and student applications, when AI generated responses started correlated with ones human readers tagged as lacking personal anecdotes.
How Coding it Forward applied GPTZeroβs API in their application cycle π
Researchers, ranging from Stanford, Harvard, Berkley, etc. have also integrated and tested our APIs, often validating GPTZero as one of the most robust and accurate AI detection models.
Other papers identified shortcomings in AI detection algorithms. One paper, for example, noted that AI detection algorithms, which rely on statistical and writing analysis, could be biased towards ESL (English Second Language) writers.2
This study motivated GPTZero to take action in eliminating the bias. We built a unique dataset of ESL writing, and applied it to build detection specific for ESL, aligning our outputs for this bias.
Our API version 2 updates includes π₯
New user interface + Starter API code provided for some programming languages.
Ability for users to directly call the API on our web page, without a single line of code needed.
Version of (GPTZeroβs) model now returned with GPTZero use; soon the ability to interact with model histories, including dialing back, and testing previous GPTZero model versions!
Building off of Academia
GPTZeroβs growing number of citations in academia motivate us learn, incorporate feedback, and constantly improve the research behind detection.
As a general principle, our model evolves rapidly, and is updated nearly every week! In February, we released one of the first-ever sentence highlighting models for mixed AI and human content. Today, our model combines a mixture of standard and novel detection techniques to produce a more robust and accurate detector. Below, our ML team shares 7 layers that make up the most novel GPTZero Model.
The most important takeaway from our retreat is that investing in a βdeep learningβ approach is worth it. Build a solid ML infrastructure and pipeline for developing detection, the rest will follow.
7 Ingredients behind the GPTZero Model
1. Burstiness
First we developed and apply a βburstinessβ check to analyze how similar the text is to AI patterns of writing. A human written document will have changes in style and tone throughout the text, whereas AI content remains similar throughout the document.
2. GPTZeroX
GPTZeroX is a sentence-by-sentence highlighting and classification model we developed in March, as one of the first models that allows mixed-text highlighting. This model analyzes each sentence in the text in the context of the whole document, and determines the probability that each sentence was created by AI.
3. Perplexity
Our perplexity test reverse engineers the generative AI model. Weβve developed an AI model similar to ChatGPT. After each word in the text, our AI model develops suggestions of what word is coming next. It checks if our suggestions match what is actually there in the text.
4. Education Module
We offer an education model option that is fine tuned for student content. Our education model has been trained with data that includes more student work than our regular model, increasing the accuracy of detecting AI for educational purposes. This model is well suited for differentiating ESL and AI-written text as well.
5. Internet Text Search
This part of our model checks if this text has been found to exist in text and internet archives. In contrast to other AI-detection services, we ensure that commonly used texts are not misclassified.
6. GPTZero Shield
We like to call this new component of our model GPTZero shield, essentially a layer that defends against other tools looking to exploit AI detectors. We maintain a database of the most common methods to βby-passβ AI detection, such as homoglyph and spacing attacks.
7. Deep Learning
Lastly, weβre using an end-to-end deep learning approach, trained on both massive text corpuses from the web, education datasets from our partners and also our own synthetic AI datasets generated from a range of language models, including most recently Llama (developed by Facebook) and GPT4. Our deep learning approach is a long term investment, that differentiates us from that majority of AI detection layers, in building a more robust model that can improve with AI improvements.
β Thanks for reading, from the GPTZero ML team :)
https://blog.codingitforward.com/2023-application-cycle-round-up-3d19e653902b
https://arxiv.org/pdf/2304.02819.pdf
Thank you for keeping readers up-to-date on generative AI. Your efforts are appreciated! I have used ChatGPT in combination with self-generated poetry to assuage my fear of not being able to understand how it works.