Using GPT-3 in Contract Review: Maigon Example
The original version of this article was published on LinkedIn.
With all the buzz surrounding ChatGPT, many are asking if we are using a large language model in Maigon.
The answer is Yes, we make use of GPT-3* (we wouldn’t be a state-of-the-art legal AI company otherwise). But this usage is very measured.
Local models
See, all our models (called, interchangeably, neural networks, deep learning models, AI models, or simply AI) that make automated legal review possible are trained internally on local hardware and on data that we collect and label ourselves. All such models focus on very specific tasks and have various architectures. (Most of the models are based on the best-in-class Transformer architecture, same architecture that GPT-3 is built upon.) For instance, one model handles clause extraction, another – NER, a third one – entity classification, a fourth – semantic similarity of clauses with the Gold Standard etc. Therefore, each model excels in its own “domain” and after executing a task “passes” the result over to the next model. All such models are local, i.e. owned, controlled, and hosted by us. In simpler terms, our contract review platform is a carefully orchestrated array of AI models, each doing its own thing with “surgical precision” and each contributing to the ultimate value, i.e. the contract review result.
Large language model
A large language model (or simply LLM) is not specific in nature. It’s broad, as it can handle a variety of different tasks without any additional training, out-of-the-box. The more parameters it has, the “deeper” it is. State-of-the-art LLMs like GPT-3 are both very broad and deep, which means that they can perform quite well at many diverse tasks at once. GPT-3 is only available via commercial API, not just due to its code being proprietary and not open-source, but also because the amount of compute it takes to keep the model running and the extent of resources required for its maintenance are simply unavailable to SMEs like Maigon. While it is possible to “fine-tune” GPT-3 on your own data to make it better at a very specific task, doing so will result in just a “fork” of the LLM which is still hosted by a 3rd party and is ultimately not owned or controlled by you. Without such “fine-tuning”, the LLM will rely on its general knowledge and language understanding to perform the task. Comparing to the Maigon platform, which is a set of precise review tools operating on limited contexts, GPT-3 is more like a general-purpose hammer.
Pros and cons of GPT-3
When deciding on whether to use GPT-3 in the Maigon contract review suite, we had to ask ourselves first - What is the added value of using GPT-3 in the document review process? The answer to this question, in turn, required understanding the advantages of GPT-3 over our own models and vice versa. Not just from a neutral architectural standpoint, but, more importantly, from a perspective of being useful in practice, i.e. being capable of enhancing our solutions and “making a difference”.
The obvious advantage is the sheer size and scale of GPT-3, which makes it more capable at nuanced insights into complex legal data and “connecting the dots” within large troves of unstructured information (as contracts often are). Another benefit is the models’ ability to perform quite well without any additional training or “fine-tuning”. Finally, there is no hardware/maintenance overhead: GPT-3 is hosted and maintained by a 3rd party (OpenAI).
As regards disadvantages, the most apparent one is its 3rd party proprietary nature. We do not have control over the availability and quality of the model. This results in the inconsistency in both accuracy and response latency: for instance, the same question can be answered with Answer X (correct) in 1 second or Answer Y (incorrect) in 15 seconds, depending on the current state of the model that we do not control. Moreover, there is no “fine-tuning” flexibility to even remotely the same extent as with our own models, which makes it hard or impossible to use the LLM with very specific datasets. Being API-only, GPT-3 is a “closed black-box” with no access to its architecture and parameters.
Nevertheless, all these comparisons aside, what ultimately matters is how capable the model is at solving problems in practice. This leads us to the next question.
Does GPT-3 deliver “next-level” performance in legal review?
Does GPT-3 provide exceptional performance, comparing to the mixture of smaller Transformer models, at contract review tasks? Can it make a real difference?
The short answer is No, at least not within the contract review setting. Locally hosted smaller models, trained on carefully labeled data and validated by customer feedback, are hard to beat at the highly specific tasks they were trained to perform. Even when GPT-3 is “fine-tuned” on the same data, the accuracy is nearly identical, which makes using the LLM impractical, considering the above-described disadvantages such as the lack of ownership and control over the resulting model. This proved to be true for such critically important stages of contract review as text classification (clause extraction), span categorization and NER (both - legal concept extraction).
However, GPT-3 is still very useful in some cases.
GPT-3 use cases
Let’s say, we have a complex logical problem to solve, such as:
- In the provided sub-processing section of the DPA, detect any sub-processors whose activities are the same or very similar. If there are such sub-processors, determine whether all of them are located within the EU/EEA or only some of them are.
For the sake of the given example, the purpose of this problem is determining whether there are “sub-processor duplicates” and if any of them have a “clone” in a third country.
A normal approach would be splitting the problem into several sub-tasks, building a separate “single-facet” model for each sub-task, then writing some logic that “orchestrates” the models, bundles the results and outputs a single answer to the complex multi-faceted problem. Doing so requires some time spent on the data collection, model architecture selection, models’ training, integration, and maintenance, in addition to extra compute resources required for such models to run. (Sometimes, however, this approach is the only option. For instance, when we need a high level of granularity and customization, which is impossible to achieve with GPT-3. Or when we are dealing with a more complex problem.)
GPT-3, on the other hand, allows us to attempt to answer the above question right away, without any data collection or model training. To do that, we need to supply the LLM with a very specific “prompt”, which includes an instruction on what to do (e.g. the above problem description) and some context for the action (e.g. the relevant sub-processing fragment from the DPA). Based on the provided information, GPT-3 will attempt to solve the problem (i.e. provide an answer), and it will rely on its general knowledge and language understanding when doing so.
In this particular example, the LLM’s chances of arriving at a correct conclusion are not low, considering the depth of the model and its ability to “connect the dots”, as described above. The provided answer is not guaranteed to be correct, however: like any AI, even the largest models make mistakes sometimes. It’s all about “acceptable accuracy”: If we conclude during the testing that answers to the above problem are generally correct, we can use GPT-3 for solving this task. This way, we save time and resources that we would otherwise have spent on training / maintaining our own models.
But whether the accuracy is good enough largely depends on the complexity of the problem at hand. If we make the above problem a couple more “layers” deep, the LLM will, in most instances, get “confused” and the accuracy will not be acceptable. The type of the problem also matters: it is common knowledge that language models (including LLMs) are quite bad at mathematical tasks, so performing e.g. tax calculations with a model like GPT-3 would produce incorrect results most of the time.
It all comes down to testing the LLM at solving various problems and determining where the accuracy level is acceptable for the model’s insights to be integrated into the client-facing solution.
GPT-3 is “asking for directions”
Nonetheless, when GPT-3 is used, in most cases it still relies on the local AI models, since the prompt needs to be supplied a specific fragment of the document that is extracted by a local AI (in the above example, the sub-processing section of the DPA). We cannot simply send the full contract to the LLM due to the limit of the length of text (number of tokens) that can be submitted at once. And GPT-3, unlike ChatGPT, does not “remember” previous submissions, so sending the contract “in parts” is not an option either. Therefore, asking the local models “for directions” is often necessary when making use of the LLM.
The best of both worlds
The ultimate question is not whether to use GPT-3 (or any LLM for this matter), but rather to what extent it should be used. A large language model is capable of providing correct insights into complex problems quickly without any “fine-tuning” involved, which saves development time and frees up resources for more important tasks. At the same time, the lack of ownership and control over the LLM and its availability, performance instability, and limited flexibility make using the model’s API for critically important tasks not advisable.
This is the reason why the contract review process in Maigon is handled by our custom models locally, while GPT-3 is used for additional insights that are often very helpful but not essential for the end result (i.e. contract compliance report). For a state-of-the-art contract review platform like Maigon, combining “the best of both worlds” is the best way to go forward.
* There are different versions of GPT-3. All the mentions of GPT-3 in this article refer to the most powerful version of the model - “text-davinci-003”, released in November 2022.
Sergii Shcherbak, CTO @ Maigon