find it on github (opens in a new tab)
Embedbase
Open-source API & SDK to connect any data to ChatGPT
Before you start, you need get a an API key at app.embedbase.xyz (opens in a new tab).
Note: we're working on a fully client-side SDK. In the meantime, you can use the hosted instance of Embedbase.
Design philosophy
- Dead simple
- Open-source
- Composable (integrates well with any AI providers, databases and LLM helpers)
What is it
These are the official clients for Embedbase. Open-source API & SDK to easily create, store and retrieve embeddings.
Who is it for
People who want to
- plug their own data into ChatGPT or any other LLM
- build recommendation systems
- build search engines
- build classification engines
- etc.
Embedbase Vector
Vector databases are specialized databases designed to work efficiently with high-dimensional vectors, often used for representing data in machine learning applications. Examples of such data include embeddings, which are numerical representations of unstructured data such as text, images, and more. Vector databases enable efficient storage, search, and retrieval of similar items based on their embeddings.
Embedbase Vector makes it easy to use vector dbs. It simplifies the process of working with embeddings and enables users to easily store, index, and search unstructured data. With Embedbase, even those without machine learning expertise can build recommendation engines, search engines, and Q&A engines, offering seamless integration from development to deployment for a wide variety of applications.
How to use the SDK
Installation
npm install embedbase-js
Initializing
import { createClient } from 'embedbase-js'
// you can find the api key at https://embedbase.xyz
const apiKey = 'your api key'
// this is using the hosted instance
const url = 'https://api.embedbase.xyz'
const embedbase = createClient(url, apiKey)
Searching datasets
You can search your dataset(s) using natural language queries. Embedbase will return the most similar items to your query, along with their similarity scores.
// fetching data
const data = await embedbase
.dataset('test-amazon-product-reviews')
.search('best hot dogs accessories', { limit: 3 })
console.log(data)
// [
// {
// "similarity": 0.810843349,
// "data": "This nice little hot dog toaster is a great addition to our kitchen. It is easy to use and makes a great hot dog. It is also easy to clean. I would recommend this to anyone who likes hot dogs."
// "metadata": {
// "path": "https://amazon.com/hotdogtoaster",
// "source": "amazon"
// },
// "embedding": [0.35332, 0.23423, ...]
// },
// {
// "similarity": 0.294602573,
// "data": "200 years ago, people would never have guessed that humans in the future would communicate by silently tapping on glass",
// "embedding": [0.76532, 0.23423, ...]
// },
// {
// "similarity": 0.192932034,
// "data": "The average car in space is nicer than the average car on Earth",
// "embedding": [0.52342, 0.23423, ...]
// },
// ]
You can also filter by metadata:
const data = await embedbase
.dataset('test-amazon-product-reviews')
.search('best hot dogs accessories')
.where('source', '==', "amazon")
Using Bing Search
The point of internet search in embedbase is to combine your private information with latest public information.
Also remember that AIs like ChatGPT have limited knowledge to a certain date, for example try to ask ChatGPT about GPT4 or about Sam Altman talk with the senate (which happened few days ago), it will not know about it.
The recommended workflow is like this:
- search your question using internet endpoint
- (optional) add results to embedbase
- (optional) search embedbase with the question
- use .generate() to get your question answered
const data = await embedbase
// for example this is a very recent AI paper that LLM have no knowledge about
.internetSearch('qlora machine learning paper')
console.log(data)
// [
// {
// "title": "Qlora: ...",
// "description": "We present Qlora, ..."
// "url": "https://arxiv.org/abs/2104.07540",
// },
// ...
// ]
Adding Data
You can add data like text to your dataset(s), if you need to add images or other format like audio, please reach out on Discord (opens in a new tab) and we will make it happen instantly!
You can also add metadata along the data. It can be useful, for example if you are feeding a LLM like ChatGPT, a typical best practice is to add the source of the text as metadata. For example an URL. Then you can ask the AI to add links or footnotes in it's output.
The simplest way to add data to embedbase is to use chunkAndBatchAdd
which runs ideal parameters by default:
const documents = [
{ data: 'This is a document' },
{ data: 'This is another document' },
{ data: 'This is a third document' },
{ data: 'This is a fourth document' },
{ data: 'This is a fifth document' },
{ data: 'This is a sixth document' },
{ data: 'This is a seventh document' },
{ data: 'This is a eighth document' },
{ data: 'This is a ninth document' },
{ data: 'This is a tenth document', metadata: { path: 'https://google.com/abcd' }}
]
const data = await embedbase.dataset('test-amazon-product-reviews').chunkAndBatchAdd(documents)
console.log(data)
// [
// {
// "id": "eiew823",
// "data": "This is a document",
// "embedding": [0.1, 0.2, 0.3, ...]
// },
// {
// "id": "zfuzfv",
// "data": "This is another document",
// "embedding": [0.1, 0.2, 0.3, ...]
// },
// {
// "id": "egreegregr",
// "data": "This is a third document",
// "embedding": [0.1, 0.2, 0.3, ...]
// },
// ...
// {
// "id": "vsdfvdvd",
// "data": "This is a tenth document",
// "metadata": {
// "path": "https://google.com/abcd"
// },
// "embedding": [0.1, 0.2, 0.3, ...]
// }
// ]
You can also use add
, or batchAdd
.
Updating data
To update data, embedbase offers multiple path:
- For example you have files that you add to embedbase, that sometimes change on your side and want to update it in embedbase
in this situation, you should add a metadata key to identify each files, for example
name
and then usereplace
const documents = [
{
data: 'Nietzsche - Thus Spoke Zarathustra - Man is a rope, tied between beast and overman β a rope over an abyss.',
metadata: {
tag: 'philosophy',
},
},
{
data: 'Marcus Aurelius - Meditations - He who lives in harmony with himself lives in harmony with the universe',
metadata: {
tag: 'philosophy',
},
}
]
await embedbase.dataset('library').batchAdd(documents)
res = await embedbase.dataset('library').replace([{
data: 'Nietzsche - Thus Spoke Zarathustra - One must have chaos within oneself, to give birth to a dancing star.'
}, {
data: 'Marcus Aurelius - Meditations - The happiness of your life depends upon the quality of your thoughts.'
}, {
data: 'Lao Tzu - Tao Te Ching - When I let go of what I am, I become what I might be.'
}], 'tag', '==', 'philosophy')
console.log(res)
// [
// {
// data: 'Nietzsche - Thus Spoke Zarathustra - One must have chaos within oneself, to give birth to a dancing star.',
// metadata: {
// tag: 'philosophy',
// },
// },
// {
// data: 'Marcus Aurelius - Meditations - The happiness of your life depends upon the quality of your thoughts.',
// metadata: {
// tag: 'philosophy',
// },
// },
// {
// data: 'Lao Tzu - Tao Te Ching - When I let go of what I am, I become what I might be.',
// metadata: {
// tag: 'philosophy',
// },
// }
// ]
Note that here it will replace all documents tagged philosophy
with the given documents (keeping the previous metadata
).
- You can also decide to store ids that are returned my most embedbase functions and use it to uddate your data:
const data =
await embedbase.dataset('test-amazon-product-reviews').update([{
// you get this id from add/batchAdd/search/list response
id: 'eiew823',
data: 'some new text',
metadata: {
path: 'https://google.com/new'
}
}])
console.log(data)
// {
// "id": "eiew823",
// "data": "some new text",
// "metadata": {
// "path": "https://google.com/new"
// },
// "embedding": [0.1, 0.2, 0.3, ...]
// }
- Lastly, you might just create a new dataset every time something change, that assume you always have access to all the data for example, a github repository.
Splitting and chunking large texts
AI models are often limited in the amount of text they can process at once. Embedbase provides a utility function to split large texts into smaller chunks. We highly recommend using this function.
If you're just getting started we recommend using the abstraction chunkAndBatchAdd
which runs ideal parameters by default:
embedbase.chunkAndBatchAdd(...)
Otherwise, To split and chunk large texts, use the splitText
function:
import { splitText } from 'embedbase-js';
const text = 'some very long text...';
// β οΈ note here that the value of chunkSize depends
// on the used embedder in embedbase.
// With models such as OpenAI's embeddings model, you can
// use a chunkSize of 500. With other models, you may need to
// use a lower chunkSize value.
// (embedbase cloud use openai model at the moment) β οΈ
const chunkSize = 500
// chunk_overlap is the number of tokens that will overlap between chunks
// it is useful to have some overlap to ensure that the context is not
// cut off in the middle of a sentence.
const chunkOverlap = 200
splitText(text, { chunkSize, chunkOverlap }).map(({chunk}) =>
embedbase.dataset('some-data-set').batchAdd([{data: chunk}])
)
Creating a "context"
createContext
is very similar to .search
but it returns strings instead of an object. This is useful if you want to easily feed it to GPT.
// you can create a context to store data
const data = await embedbase
.dataset('my-documentation')
.createContext('my-context')
console.log(data)
[
"Embedbase API allows to store unstructured data...",
"Embedbase API has 3 main functions a) provides a plug and play solution to store embeddings b) makes it easy to connect to get the right data into llms c)..",
"Embedabase API is self-hostable...",
]
Generating text
Under the hood, generating text use OpenAI models. If you are interested in using other models, such as open-source ones, please contact us.
Remember that this count in your playground usage, for more information head to the billing page (opens in a new tab).
const data = await embedbase
.dataset('my-documentation')
.createContext('my-context')
const question = 'How do I use Embedbase?'
const prompt =
`Based on the following context:\n${data.join('\n')}\nAnswer the user's question: ${question}`
for await (const res of embedbase.generate(prompt)) {
console.log(res)
// You, can, use, ...
}
You can also send the history, like in OpenAI API (opens in a new tab):
const history = [
{"role": "system", "content": "You are a helpful assistant that teach how to use Embedbase"},
{"role": "user", "content": "How can I connect my data to LLMs using Embedbase?"},
{"role": "assistant", "content": "You can npm i embedbase-js, write two lines of code, and..."},
{"role": "user", "content": "how can i do this with my notion pages now using their api?"},
]
// ...
for await (const res of embedbase.generate(prompt, {history})) {
console.log(res)
// You, need, to, query, Notion, API, like, so:, ...
}
Listing datasets
const data = await embedbase.datasets()
console.log(data)
// [{"datasetId": "test-amazon-product-reviews", "documentsCount": 2}]
Listing documents
const data = await embedbase.dataset('test-amazon-product-reviews').list()
console.log(data)
// [
// {
// "id": "eiew823",
// "data": "Lightweight. Telescopic. Easy zipper case for storage.
// Didn't put in dishwasher. Still perfect after many uses.",
// "metadata": {"path": "https://www.amazon.com/dp/B00004OCNS"},
// "embedding": [0.1, 0.2, 0.3, ...]
// },
// {
// "id": "uzvuzv",
// "data": "Lightweight. Telescopic. Easy zipper case for storage.
// Didn't put in dishwasher. Still perfect after many uses.",
// "metadata": {"path": "https://www.amazon.com/dp/B00004OCNS"},
// "embedding": [0.1, 0.2, 0.3, ...]
// }
// ]
Clearing a dataset
Be careful, this will delete all the documents in the dataset and cannot be recovered.
await embedbase.dataset('test-amazon-product-reviews').clear()
Create a recommendation engine
Check out this tutorial (opens in a new tab).
Contributing
We welcome contributions to Embedbase (opens in a new tab).
If you have any feedback or suggestions, please open an issue or join our Discord (opens in a new tab) to discuss your ideas.