Amber: Fine tuning human and AI relationships
Introduction
Just before ChatGPT blew up, Nurobodi worked on an AI design and development sprint using OpenAI base model GPT3 - essentially the foundation of what ChatGPT is. This was a timely project that highlighted the importance of the ‘human in the loop’ in partnership with the ‘ethical fine tuning’ of an AI Chatbot. In short, the Amber project reveals insights into how foundational GPT3 base models synthesize responses that are far from perfect.
Working with seasoned RMIT researcher Dr. Lisa Dethridge, CSIRO’s Responsible Innovation Future Science Platform, and Data61 consultants, the Nurobodi team engaged in the design and development of a very different kind of AI chatbot. Of primary importance was investigating the “ethics” of AI programming in both process and product outcomes. We sought to identify key problems, barriers, and opportunities and to develop ethical solutions while demonstrating human-centered User Experience and User Interface design methodologies and low-cost, rapid prototyping of a minimum viable product.
Our hypothesis with Amber is that this is a possibility if and only if developers learn to build models in accordance with responsible practices and values using more mindfulness within the mechanics of design and development.
Amber - A Journey of Information with Substance
Why Amber? Nurobodi design process pathways focused on using particular enigmatic environmental characteristics as a central concept. We refined the general idea of an “Ethical A.I chatbot” to a more design industry facing: “GPT-3 chatbot prototype for product designer/developers to interact with regarding the use of AI ethically within their industry”. To do this we first turned to nature for guidance and honed in on the substance of Amber.
1. Amber - Hard translucent fossilised resin originating from extinct coniferous trees of the Tertiary period. Often containing remnants of preserved flora and fauna. In this process, the design and development team are ultimately both the bugs preserved within the Amber and also the process that ensures that the Amber is transparent enough, pure enough, that we can see clearly what our bug-like nature looks like.
2. Amber - A yellowish/orangey light used as a cautionary signal between green for ‘go’ and red for ‘stop’. This connects to the importance of proceeding with caution as ethics are always contextual. The following image is taken from OpenAI’s traffic light safety system.
Amber as Research Process Documentation
Working in Mural and Notion for UX/CX development, and Pycharm and Google Colab for IDE prototyping, Nurobodi UX design and development focused on understanding and investigating the importance of the process and creation of highly specific datasets used in training and fine-tuning an AI chatbot towards ethically informed outcomes. This article offers a window into our practice-led research and development of a human-machine teaming process whereby humans train a chatbot assistant with a specific intention. Our specific design intention is to assist product designers or developers in the conduct of ethical or responsible innovation, particularly where AI is used in the design and development of sustainable and future-ready products or services.
Our research discovered a key part of developing ethical ML models is the creation and distinct qualities of a dataset used in training and fine-tuning: This is critical to biased outputs from the bot which can either be negatively, neutrally or positively aligned with industry-based CX/UX context.
It is understood that the OpenAI GPT3 base models are somewhat of an ethical ‘black box’ regarding its foundational ML and NLP training data-sets. This too is a form of underlying bias.
Both the research documentation (design/dev) and the proof of concept prototype of Amber are targeted towards illustrating the ontological design importance of humans within the loop of ML systems working collaboratively to clarify and optimise ethical product design and development.
We position the human user of the chatbot prototype not as a passive consumer of machine- generated information but as an active, collaborative participant in a dialogue with the machine and by extension, with colleagues or clients. By giving the user an opportunity to critically compare and contrast a varying set of outputs, they are being coached as an active subject-matter expert in the driver-seat.
Amber as a Prototype
The Amber prototype is a demonstration of Nurobodi’s CX/UX process and product design and development of an AI chatbot. As a ‘fine-tuned’ GPT3 based chatbot Amber was able to demonstrate its ethically ‘fine-tuned’ abilities when compared to a base model Davinci GPT3 chatbot. The use case application of Amber is a proof of concept and MVP (minimum viable product) demonstrating the importance of mindfulness as a mechanic when approaching the design and development of industry-based deployable dataset construction for AI chatbots, specifically as a virtual assistant to industrial product designers and developers.
We created the Amber prototype as a specialized tool to assist product designers/developers working to clarify ethical considerations and guidelines that currently relate to the use of Artificial Intelligence in their industry. A.I is becoming pervasive in industrial design practices at all points of a product’s development including emerging methodologies, techniques and design materials. It will likely soon be commonplace to see A.I not just as part of a program or optimisation process but also as our A.I assistants and even colleagues.
As Christoffersen and Woods point out, we may well envisage AI agents as “team players” who augment the human experience. This research project explores how the introduction of varying degrees of automation into a complex work environment may create a new kind of human-machine teaming; a "cognitive system of distributed human and machine agents and new artifacts." (Christoffersen and Woods, 2001, p.4)
Research and Development Overview
Early UX/UI Considerations
A comparison of two models is based on the concept that the product designer/developer can be informed by both a human-curated chatbot and a synthetically curated chatbot ie 👩🔬Amber vs 🧑🎨 Davinci. Yes, the prototype chatbot User Interface represents the outputs of not one but two GPT-3 chatbot models or engines, named Amber and DaVinci.
Rather than putting forward Amber as a singular source of knowledge, we have deliberately drawn user attention to the contrasting responses of two chatbot models on one interface. This critical User Experience factor focuses the user on the importance of human discussion, user discretion and discernment in evaluating design options or information put forward by the chatbot AI. Both models are presented on the same User Interface and provide two distinct-but-related answers to the same user question. The Amber model output is based on expert information about Ethics and Artificial Intelligence sourced from international guidelines on ethical AI. The Amber engine interface also reveals text from the “Match” or original source of its output material.
The simple but powerful dual User Interface of our prototype allows designers or developers to engage in a process of meta-communication; it fosters and refines communication about ethical AI communication design, encouraging a unique approach to the use of AI and data analytics. Rather than consuming information uncritically, users combine their lines of questioning and prompting to a highly discerning system of ethical self-awareness building, collaboratively.
UX/UI design thinking in this case study offers a reflective opportunity to reconcile the difference between an ‘ethically’ trained AI (Amber) and a base model (Davinci). During development we anticipated that Amber’s response will be more informed within areas relating to context derived from core framework texts but also that this may not always be the case as a well-designed initial ‘prompt’ for Davinci can still produce high-quality results. Again, the intentional UX/UI design thinking in the research process and prototype is to highlight the importance of:
The framing of a question (mindfulness of the relationship between the ‘quality’ of the question and the quality of the answer)
Two outcomes (answers) in response to a user question that in turn leads to self reflection (or discussion) on the nature and manner of the construction of processes addressing underlying information frameworks. This in turn is linked to the use of data-sets as a fundamental approach to ethically fine-tuning GPT3 AI.
We wanted the prototype to demonstrate various aspects of User Interface and User Experience design. It is useful for AI developers to observe the responses from two different chatbot models in order to offer the user multiple perspectives. This dual UI (User Interface) invites the user to reconcile the difference between the two sets of informative outcomes. In addition, the interface reveals the “match” or the original source material that Amber was fine-tuned with. In this sense, the answer to the user’s question is a starting point, not an end point. The user must compare and evaluate both sets of answers (which can be lengthy) in order to draw conclusions or to enquire further. In this process the chatbot can virtually engage and coach the user; refining their ability to ask more informed questions. The dual responses of Amber and DaVinci provide a dynamic mirror for the user, encouraging a more rigorous and self-reflexive approach to the human-machine interface.
Early Stages of R&D
Early stages of R&D was focused on investigation of the various approaches and applications of OpenAI fine-tuning.
Davinci has superior computational resources resulting in generally higher quality contextual and conversational logic at a higher cost per API call in comparison to other base models.
Curie, as a base model, is not as strong when it comes to analysing complicated text when compared to Davinci, however we plan to pursue Curie as the base model for Amber based on a hypothesis that a ‘fine-tuned’ Curie Q&A based chatbot may outperform Davinci both economically and with regards to the quality of ethically informed answers in response to user input/questions.
The aim for R&D is to hone in on Amber’s emphasis not as having conversational prowess in addressing contextual ethics within a conversation framework (chat), instead, Amber should perform highly at outputting ‘best practice’ ethical answers based on its programmatic ability to extract context from particular bodies combined with its ability to deliver relevant answers (completions) to user questions.
Being able to compare responses from a conversationally more eloquent Davinci GPT3 and the Amber model may offer a unique feature of the design thinking underpinning this research process - Comparison on the guidance of an ‘ethical directive’ is at best a starting point to be reconciled, not an endpoint to be fixated upon.
A Brief Note on Conditional Generation
Conditional Generation dataset requires a list of prompts and a list of completions suited to the specific CX/UX needs. The image below is a key area of our research process seeking to answer the question, ‘Do we need to build a semantic matching program? Conditional Generation and Answers as a means to establish a highly specific fine-tuned model based on core framework texts may be the best approach to a more controllable & directable model capable of giving highly specific answers within a given field and for future development beyond the scope of demonstrating Amber as an MVP.
💡 Conditional Generation initially seemed the most plausible way forward to explore ‘fine-tuning’. For the purposes of conversations regarding Amber fine-tuning means specifically Conditional Generation when referencing OpenAI documentation, however in general conversational terms fine-tuning is used colloquially to describe shaping Amber’s model towards a specific purpose.
As we were focused on an industry ready MVP in this particular research project phase we moved on to testing and subsequent development using OpenAI Embeddings.
Embeddings. What are They?
💡 An embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information-dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar. OpenAI’s recent Embeddings update and initial research revealed that Embeddings as an overall strategy utilising Similarity and Text search models showed great potential capabilities based on Amber’s requirements.
Embeddings Testing with Chunks
It was still important to perform a preliminary test of Embeddings functionality before proceeding with the entirety of breaking down core texts into Embedding Chunks. In this example we only used a few manually generated chunks of core text for testing.
Manual chunking of core texts for Embeddings vs automation
This was also the point where we asked, whether we should chunk texts manually or write a program to chunk automatically. In many ways, this juncture in the process was core to the line of inquiry that questions the importance of the role of the human in the loop in collaboration with automated processes.
Another caution to mention at this point and all discussions of emerging AI platforms is the ever-present conflation and confusion around key terms used to describe programming protocols – prompts, questions, completions, answers, model/engine, tuning, data-processing and automation are all key terms generally used in loose and ambiguous ways by researchers and practitioners. This slippage in technical language may inevitably produce errors down the track. All this is a hallmark of emergent or frontier technology and is flagged here as an interesting and important area for the intersection of interdisciplinary discussion and research between subject matter experts and designers/developers.
A Brief Note On Over-Automation
Over-automation may lead to the erosion of ethical relationships between questions and answers. However, this is an important ongoing area of discussion to be having around the ability to program optimisation of contextualisation of core texts into chunks that form the building blocks of ethical data-sets.
While developing the Amber chatbot, we paid attention to how the semantics or meaning of various core source texts may be effected by our engagement with the GPT-3 chatbot engine. At one stage, we paused the process of automated data preparation and began “chunking” - the creation of "Embeddings" or key contextual information. The manual selection of chunks ensures a kind of "human-in-the- loop" who is overseeing the data preparation of each of the embeddings prior to the creation of the final data-set. We saw this as a way of auditing the context of the dataset in a human manner rather than leaving it to pure automation. The use of human selected chunks as opposed to purely automated chunks meant the quality of data was monitored to a higher degree. It’s possible that without intervention in the accumulation of contextual data industrial designers may be tempted to skip a “restrospective audit” on the quality of the data. Definitely tempting to do when datasets can be so large. The downside to human selection is the amount of labor required to build larger datasets. Leaving the question (perhaps even a wicked problem): how could it be possible to build a Natural Language Extraction Program with a high degree of qualitative intelligence, which we can trust? .
Therefore, at this point we have paused further development in this area as further R&D into this area would be necessary with a central focus on maintaining an ethical approach to process and outcome using a qualitative and responsible form of automated data extraction which we surmised would take a significantly greater time to develop. A similar pause was placed when the necessary program to automatically sort core texts into appropriate completions for fine-tuning and answers occurred (see far left task in image above). In other words, as we have a very small set of documents a manual approach to the use of embeddings is viable for this design/dev sprint.
Moving Forward with Manual Chunking
The decision to move forward with manually selecting chunks instead of developing a program is summarised as preparing the rest of the core texts into key text chunks.
Including “humans-in-the-loop” with this task means further auditing of ethical contexts for the ethical dataset.
It doesn’t require the creation of a NLP chunking program.
Preparing Core Text Chunks for Embeddings
Each Core Text needs to be separated into key text chunks. Each key text chunk should be seen as having key information to answer questions. The token limit is approximately 1000 per embedding.
💡 We found through testing that less tokens is generally matched more accurately to a user question. More testing will be helpful in this area to highlight relationships of specificity between questions and answers via embeddings. In this sprint our estimate is approx 500 tokens with a maximum of 1000 tokens.
This web application can be used for checking token counts.
The Chunks Folder from our project repository contains the finished chunks. The Core Texts were edited to show where the chunks have been selected from core texts.
Processing Core Text Chunks into Embeddings
Step 1. We use GPT2 tokeniser to turn chunks into CSV, “pre_embeddings_df” which the OpenAI embeddings API can read with babbage connected to our Google Colab IDE. GPT2 here refers to a function used for certain word vectors.
Step 2. “pre_embeddings_df.csv” is then processed into embeddings using babbage models i.e: engine='text-similarity-babbage-001’
and engine='text-search-babbage-doc-001’
.
Testing with Babbage Search
This test confirmed a key research period into the successfully demonstrated effectiveness of Embeddings. Since the Embeddings test was successful we now had the task of completing manual chunking of the entire collection of Core Texts to produce our data-set. In order to finish the creation of the ethical data set we run the final collection of chunks through the program we developed to do this: CoreText_EmbedPrep.
Using Prompt Design with Embeddings
Next we needed to optimise a program to search embeddings dataset and respond well with a combination of prompt design and Embeddings similarity matching of input from the user.
First we linked the correct embeddings match within the design of the prompt for either DaVinci or curie to be informed by and include the direction of answering the users query.
Here is an one of the outputs showing the matched embedding and the curated response:
Ask a question: How do i design a bot?
This is the (embeddings) match: Be transparent about the fact that you use bots as part of your product or service. Users are more likely to trust a company that is transparent and forthcoming about its use of bot technology, and a bot is more likely to be trusted if users understand that the bot is working to serve their needs and is clear about its limitations. It should be apparent to the user that they are not having an interaction with another person. Since designers might endow their bots with “personality” and natural language capabilities, it is important to convey to users that they are not interacting with another person and some aspects of their interaction are being performed by a bot. There are variety of design choices that can be made to accomplish this that do not degrade the user experience. Establish how the bot can help and the limitations associated with its use. Users are more likely to find a bot to be trustworthy if the bot sets reasonable expectations for what it can do and what it does not do well. Users should be able to easily find information about the limitations of the bot, including the possibility of errors and the consequences that can flow from such errors. For users who wish to “learn more,” you should offer a more detailed explanation of the purpose and operation of the bot. this is davinci summarising:
Amber Response: There is no one-size-fits-all answer to this question, as the design of a bot will vary depending on its purpose and the audience it is meant to serve. However, some general principles to keep in mind when designing a bot include being transparent about its use of bots technology, establishing how the bot can help and the limitations associated with its use, and making sure that users are able to easily find information about the limitations of the bot. Process finished with exit code 0
As you can see it has drawn specific information from the embedding match which was paired and not just repeated a sentence which is what all previous planned models would have done, including conditional generation and the question and answers endpoint.
We can easily change which model responds here with a quick engine change:
response = openai.Completion.create(engine="text-davinci-002", prompt=prompt, max_tokens=500) Becomes: response = openai.Completion.create(engine="text-curie-001", prompt=prompt, max_tokens=500)
Further Prompt Design Research
The main task to research and test here is to optimise prompt design with its inclusion of a matched embedding.
The working prompt design initially created for first test was:
prompt = "\\\\nAnswer the Question: " + search_text + "\\\\nUsing the information from this paragraph: " + match
Where:
search_text
= the users inputmatch
= the matched embedding
After testing the above, the prompt design was updated to the following:
prompt = "Using relevant information from below, give an informed response and answer to the users prompt" + match + "\\n User prompt: " + users_prompt + '\\nAnswer:'
Update becomes:
users_prompt
= the users input‘\\n’
= new line spacingorder of prompt design now includes the embedding first followed by user question and then Amber response
Including “Answer” at the bottom of the prompt frames the mode’s response more clearly to the user prompt.
💡 These points essentially are the underlying structure of a ‘chat’ environment
Discoveries
Using “Respond” instead of Answer proved to be a limitation on the one input and one answer which the model currently had. It had a low temperature so as to respond with accuracy to the context.
Using “Answer” and adjusting the temperature parameter higher, proved to allow for ‘creativity’ in answering the question and providing detail and granularity to conversational context.
it looks like:
engine="text-curie-001", prompt=prompt, temperature=1, max_tokens=100
Amber Prototype - Final Design Considerations
Ongoing design sought to address the description of the bots purpose.
Here we also used the target user within its purpose description
prompt = "I am a smart **ethical ai product design chatbot**.\\n Integrate the information below and give an informed answer to the users question \\n " + match + "\\n Users question: " + users_prompt + '\\nAnswer: '
Additionally, we edited Amber’s prompted directive
from:
“Using relevant information from below”
to:
“Integrate the information below”
Presentation of Amber Prototype
As a prototype application, Amber can currently only be accessed via a private Google Colab demo requiring an OpenAI API key; an encrypted string which connects the user to a kind of “principle” or account. Due to OpenAI usage terms and conditions, there are current limits to the use and scalability of testing and/or implementation of Amber which are not discussed in this document but can be viewed here in OpenAI’s ‘Usage Guidelines (Responsible Use) here. However, here is an example of a user question and answer we tested:
THE USER asked: "I am a designer of automobiles and ask what are the ethics around designing child seat-belts, safety restraints or harnesses for automobile manufacture?"
(The chatbot runs for a moment before producing a “match” - the link to the specific “chunked embedding” or fragment of expert core text. This chunk of original text provides the basis of the chatbot’s informed response to the user question.)
This is the match: (matched embedding; the link to original source)
Ethical principles and automated vehicles. AVs, as machines which have to make decisions (in accordance with programming determined by humans) are also subject to complex and difficult ethical considerations. Some key ethical questions surrounding AVs include: Should the car be programmed to take an egalitarian approach (maximising benefit for the highest number of people) or a negative approach (maximising benefit for the occupant only, and increasing risk for everyone else)? Should car owners have a say in setting the car’s ‘moral code’? In situations where harm to humans is unavoidable, would it be acceptable for AVs to perform some kind of prioritisation – e.g. based on age? How should AVs distribute the risk of driving – for instance, would it be acceptable to program a car that valued occupants over pedestrians, or vice versa? In instances such as the fatal AV crash of March 2018, who is responsible for the harm caused – the operator or the car manufacturer?
It is relatively straightforward to program AVs in accordance with certain rules (e.g. do not cross a lane boundary; do not collide with pedestrians, etc. – although, as the March 2018 crash shows, the technology is still far from perfect in following these rules). ‘Dilemma situations’ represent cases where not all rules can be followed, and some kind of decision has to be made in accordance with ethical principles. Usually, a ‘hierarchy of constraints’ is needed to determine action. This has prompted debate over how an autonomous vehicle should weigh outcomes that will cost human lives in various situations where an accident is inevitable. Utilitarianism – maximising benefits and reducing harm for the greatest number of people, without distinction between them – is a strong principle underlying considerations of the ethics of AVs. Research by MIT has found that most people favour a utilitarian approach to AVs. However, while participants approved of utilitarian AVs in theory and would like others to buy them, they themselves would prefer to ride in AVs that protect occupants at all costs [174]. Given that car manufacturers will therefore be incentivised to produce cars programmed to prioritise occupant safety, any realisation of utilitarian ethics in AVs will likely be brought about only through regulation.
(It is worth noting that from a design perspective, the interface provides the user with an extra useful link to the original source material that informs the chatbot responses; a “matched embedding.” The “match” is a link to the embedding; a chunk of the original source text. The matched embedding most closely addresses the context that contains the most accurate potential answer to the user’s inquiry. The inclusion of the matched source text highlights an 'explainable' aspect of Amber which is another key feature of this UX/UI prototype. This aspect of the chatbot model is an example of observable or explainable AI as the user has a window into the inner layers of system operation.)
Prototype Response Example (based on embedding matches)
AMBER RESPONSE:
When designing child seat-belts, safety restraints or harnesses for automobile manufacture, there are a few key ethical considerations to keep in mind. First, it is important to prioritize safety for all users of the product, regardless of age or other demographics. Second, designers should strive to distribute the risk of injury as evenly as possible among all potential users of the product. And finally, designers should be aware of the potential for liability in the event of an accident.
(The Amber chatbot response is provided by the Amber chatbot is fine-tuned on the kind of “chunked” and “embedded” expert texts related to the key ethics topics and noted in the “matched embedding above. In this response, Amber recontextualizes the embedding in conversation terms, sometimes verbatim and other times paraphrased.)
DAVINCI RESPONSE:
There is no definitive answer to this question as it depends on the individual designer's personal ethical beliefs. However, some ethical considerations that could be taken into account when designing child seat-belts, safety restraints or harnesses for automobile manufacture include ensuring that the products are thoroughly tested for safety and effectiveness, and that they are manufactured using high-quality materials that are safe for children to use.
(Davinci responds using its own capability without benefit of the extra contextual “embedding.” The Davinci response is provided by the davinci-text-002 GPT-3 engine which has a pithy and intelligent which, if brief, response which is still philosophical in nature.)
Clearly the two models or engines are both able to offer responses to user questions. The current prototype suggests the two models offer similar answers; both types valid, but varying in scope and detail. A follow-up round of user-testing is necessary to determine any results regarding evaluation and deployment of the chatbot.
Human users of the Amber chatbot are invited to reconcile the two answers; combining them in a dialectical fashion to make their own comparative analysis of the possible options. The user of this prototype can bring their own knowledge; their values and problems to the conversation. They can also focus on the chatbot’s outputs in a way that is most helpful for their particular design context.
It is especially important to note that the number of words, that is the length of a chatbot response, is governed by a system of monetized tokens. This is but one of several major constraints which designers must observe when working with the OpenAI and other such platforms. More testing will be helpful in this area to highlight relationships between cost of tokens and the development of chatbot conversation via embeddings.
Why Is This Research Important?
We assume that future research into the design of responsible innovation tools needs to keep building on the ethical frameworks and guidelines that are emerging across the international policy environment. Ethics are effective within a specific context which needs to be illuminated to be effective or useful. A cautious, ethical approach to AI design can prevent accidents, inefficiencies, inequities and corruption while promoting the flourishing well-being of a user, an organization or a larger culture.
The chatbot design team sought to mitigate the problems of over-automation which can create errors and inefficiencies by failing to account for human factors and conditions. One key problem is that many companies and technology platforms may currently use chatbots, robots, virtual or AI entities and automated processes without recourse to any ethical protocols, data-auditing, guidelines, training or regulation.
In recognition of this fast-moving, “frontier” AI context, the Amber prototype points to the effectiveness and even requirement of an ethical baseline protocol in the development of Artificial Intelligence applications. For example, an ethical chatbot like Amber may assist in the training of human developers who can access an ethical framework as a key foundation for their work. In this way, Amber may provide a new industry standard for design and development of industrial products using Artificial Intelligence methods or techniques.
On the 22nd of September 2022, Google’s DeepMind research team published a paper describing the ‘Sparrow’ project which is very similar to Amber in that it is “an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines”. Like Sparrow, Amber demonstrates how the agent has been informed by showing linked references.
“Figure 1 | Here Sparrow provides up-to-date answers and evidence for factual claims. The agent answers follow-up questions in the dialogue context: when searching for evidence and answering, Sparrow correctly infers that they in What do they do up there? refers to the astronauts on the ISS. Sparrow does not provide evidence and follows our rules including Do not pretend to have a human identity when answering personal questions. A sample conversation from 9 September, 2022.” - Sparrow Project Archive
The Amber chatbot could be customized further and partnered with industry users who fine-tune the model with their own specific contextual data. The industry user may in future be able to integrate the ethically-trained Amber data set with their own systems. For instance the owner of a full stack application may be able to integrate Amber as a base model chatbot in tandem with their product or data management service platforms. Such an arrangement may assure regulators that the industry partner has observed or compliance with established ethical principles and guidelines.