Broadly generalizing, the prominent foundational models were trained on enormous amounts of (supposedly) public and open-source data. However, two characteristics of the training inputs are important to keep in mind as you use these models: the currency of the training data, and the breadth of the content.
Let's take GPT-3 as an example:
Both GPT-3 and GPT-4 have issues of updateability, as they supposedly rely on data with a cutoff date of September 2021, which may lead to inaccurate or incomplete responses, especially if you are asking about more current events or information. We’re still a long way from offering guarantees that a change in the world is reflected in model behavior.
They also have an issue of provenance. In traditional web-search, we can get to the specific article that provided the output search and verify whether the info is correct. In contrast, when we use LLMs, we just get the output response. The model might provide a provenance string, telling us where it found the information, but we still can’t asses whether we can trust it or not.
<aside> 👉 Next: Open Source vs. Proprietary
</aside>
<aside> ☝ Back to The Models
</aside>