What is Natural Language Processing, and how does it work?
Context, nuance, negation and colloquialism: How Xapien can read like a human
At Xapien, we prioritise simplicity and transparency. Our platform is easy-to-use and easy-to-read. We always display the sources for the information found in our reports. Although the reports are created using complex, cutting-edge technology, we strive to present our findings in a simple and easy-to-digest way.
In that spirit, this series of blogs will break down how our Explainable AI technology works, as simply as possible. The goal is for you to understand how and why you receive the information presented in our reports. This will help you set your due diligence, compliance or know your customer processes in context and, if necessary, justify them to colleagues, customers or regulators.
What is Explainable AI?
Explainable AI (XAI) is a subdivision of AI that makes the decisions and predictions produced by AI models interpretable and transparent to humans. The goal of XAI is to create AI systems that can clearly justify how they arrived at a certain decision or prediction, rather than just providing a mysteriously-formed output. It is the AI-equivalent of ‘showing your working.’
We use it because it helps build trust in AI systems and makes them suitable for use in subtle applications such as background research and due diligence.
This article will focus on Natural Language Processing (NLP).
NLP refers to the way computers process human language. It involves understanding and generating language, text classification, sentiment analysis, translation, and speech recognition. It is what allows Xapien to mimic and implement the micro-mannerisms that you as a human do automatically when you read text.
Here at Xapien, we have built a unique system that harnesses the linguistic knowledge embedded deep within Large Language Models, a seminal development within NLP. This technology enables our AI to read text like a human by using attention mechanisms to weigh the importance of different words in a sentence. It can understand the context and meaning of passages of text. It can focus on certain parts of the input, such as specific words or phrases, while disregarding others that are less important, just like you would when you read an article.
The difference is the rate at which Xapien can process this text – it can read thousands of times faster than any human, and can read in 187 languages.
Our NLP engine is called Fluenci. It extracts data from unstructured text, such as blogs, news articles, corporate records and public social media posts, processes it, understands it, and presents it for you in a report.
Our NLP engine loads the content from an article , extracting and formatting the key textual content plus some key images from it, translating the text where necessary, so that the text can be analysed.
2. Named entity recognition
Named Entity Recognition is a subtask of NLP concerned with locating and classifying mentions of ‘named entities’ in text. For Xapien, this means identifying references to key real-world objects relevant to a report such as people, organisations, locations, date/time mentions, and categorising them accordingly. Access to this information is crucial for determining whether an article is relevant to the subject, but it can be tricky – e.g. having to distinguish ‘Paris’ the place from ‘Paris’ the name of a person.
3. Risk identification
This is best understood in practice, so we’ll now take you through some of the ways Xapien’s NLP technology can read like a human does.
– To shoot hoops.
– To shoot a man.
– He was unable to charge his phone.
– He was charged with fraud.
– She poached an egg.
– She poached an elephant.
In these examples, it is clear to you that only the second examples pose a risk. This is something that some AI struggles with, but we have trained Xapien to distinguish between words and phrases that have different meanings depending on context. We call this ability to understand context ‘word-sense disambiguation’.
– John provided vital evidence when his boss was accused of fraud.
– John was accused of fraud.
Xapien can also understand whether a risk implicates your subject or not. If it does not directly implicate your subject, we mark it as an ‘indirect risk’.
Xapien compares every mention of a company, location, person and phrase against wider knowledge on the internet. It knows that Marlboro is a brand of cigarettes, which are a type of tobacco, and can flag a mention as a risk.
Our new feature, ‘Hideable risk’ allows you to select relevant risks. For example, the tobacco industry might be a risk to a cancer charity, but not to another organisation.
However, names are not unique identifiers. When you are reading an article and decide whether or not a mention of a person or company refers to the one you are looking for, you look at other people, sectors, topics or organisations mentioned to help you decide whether whether the ‘Lucy Smith’ you are investigating is the one being described in the report.
Our NLP technology models every piece of information as ‘possibly’ true, capturing where it came from and our confidence in it. It then weaves a unique knowledge tapestry of your subject and the people, companies, events and concepts that relate to them, tying together oblique references to your subject and the topics relating to them.
Articles about other people with the same name are rejected, enabling you to focus on the real risks about your subject.
Nuance refers to the context in which different words and phrases are used. For example, a sarcastic ‘Great!’ is very different from a sincere ‘Great!’.
As humans, we have absorbed a huge amount about the world around us and notice subtle differences in tone, emphasis and word choice. We use this innate knowledge to inform and decipher ambiguity in what we read, often without thinking.
Machines have limited real-world knowledge. However, our NLP technology finds and reads supporting information on the internet, allowing it to enrich every nuanced phrase and set it in real-world context.
Xapien can do this in over 130 languages. All content is translated into English and scanned for regulatory, reputational and values-led risks.
Negations are words or phrases that indicate that something is not true or does not exist, such as ‘not’ or ‘never’.
He was convicted. He was not convicted.
The two sentences have totally opposing meanings, but both contain the word ‘convicted’. Some AI models would flag the sentence as a risk, requiring you to carry out further research.
Xapien can identify each mention of your subject within an article and examine the words linguistically associated with that mention. It can understand what each word really means, enabling it to flag when there is genuine risk that relates to your subject. If the risk does not directly implicate your subject, we mark it as an ‘indirect risk’.
Colloquialisms are informal expressions that are often spoken aloud but rarely found in formal writing. They might, however, appear on social media.
In British slang, this can refer to a naïve or gullible person, or a cup for hot drinks
‘Her Majesty’s Pleasure’
In British slang, this refers to being in jail
This informal, often rapidly evolving language can be difficult for AI models to understand because it deviates from standard lexical rules and can be excluded from traditional dictionaries. However, they can provide vital information about a subject. By drawing on diverse text datasets, we have trained Xapien to recognise colloquial expressions in multiple languages, understand their contextual meaning, and highlight any risks.
Xapien’s Natural Language Processing extracts, reads and links information just like a human would, but at unparalleled speed and scale. Save hours or days of searching and gain a unique understanding of your subject.
AI insights, straight to your inbox
Search engines are great but they are only the starting point. Finding, reading and condensing the full picture is slow, hard, and painstaking work. Xapien can help.