What is Embedding?
Embedding refers to a numerical representation of objects or entities such as words, sentences, or documents, images and audio in a continuous vector space.
Essentially, embeddings enable machine learning models to find similar objects. Given a photo or a document, a machine learning model that uses embeddings could find a similar photo or document. Since embeddings make it possible for computers to understand the relationships between words and other objects, they are foundational for AI.
Word | Embedding |
cat | [0.2, 0.4, -0.1, 0.8, 0.5] |
dog | [0.3, 0.6, -0.2, 0.7, 0.4] |
The embeddings are designed such that words with similar meanings or contexts have embeddings that are close together in the embedding space. In this example, “cat” and “dog” might have embeddings that are relatively close to each other because they often appear in similar contexts, such as “pets” or “animals”.
What is a vector in machine learning?
In mathematics, a vector is an array of numbers that define a point in a dimensional space. In more practical terms, a vector is a list of numbers — like {1989, 22, 9, 180}. Each number indicates where the object is along a specified dimension.
In machine learning, the use of vectors makes it possible to search for similar objects. A vector-searching algorithm simply has to find two vectors that are close together in a vector database.
To store theses vector embedding we need DB , below are the few Databases that can store embeddings.
Few Vector DB
- Pinecone
- Chroma DB – opensource Free
- Deep lake
- SingleStore
Dot Product:
The dot product is a mathematical operation that measures the similarity between two vectors in a vector space. In the context of embeddings, such as word embeddings or sentence embeddings, the dot product can help in searching for similar items by quantifying the similarity between their embedded representations.
Tools used for Demo
Hugging Face:
all-MiniLM-L6-v2
This is a sentence-transformers model: It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search.
Sign Up/Login: Go to the Hugging Face website (https://huggingface.co/) and sign up for an account if you don’t already have one. If you have an account, log in using your credentials.
Access Account Settings: Once you are logged in, click on your profile icon at the top right corner of the page. Then, select “Settings” from the dropdown menu.
Generate API Key: In the “Settings” page, navigate to the “API keys” section. Here, you should see an option to generate a new API key.
Chroma DB:
Chroma is a database for building AI applications with embeddings. It comes with everything you need to get started built in, and runs on your machine.
Install:
pip install chromadb
create chroma db client :
chromadb.client(setting(chroma_db_impl=”duckdb+parquet“,persist_directory=”db/”))
DuckDB: DuckDB refers to the DuckDB database engine. DuckDB is an in-memory analytical database system designed for fast query processing and efficient data storage, as mentioned earlier.
Parquet: Parquet is a columnar storage file format commonly used in big data processing frameworks like Apache Hadoop and Apache Spark. Parquet is designed for efficient storage and processing of large datasets, especially in distributed environments. It provides features such as efficient compression, encoding, and column pruning, making it suitable for analytics workloads.
persist_directory ; Directory to store data
Create collection: Collection is like table in SQL server
collection = chroma_client.create_collection(name=”my_collection”)
Query Collection :
Chroma collections can be queried in a variety of ways, using the .query method.
You can query by a set of query_embeddings, The query will return the n_results closest matches to each query_embedding, in order.
Distance:
Distance is a metric used to measure the similarity or dissimilarity between data points or embedding, when performing a query on a collection.
# Import necessary modules and classes from ChromaDB library
import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions
# Initialize the Hugging Face Embedding Function with the specified model and API key
hf_ef = embedding_functions.HuggingFaceEmbeddingFunction(
model_name=”sentence-transformers/all-MiniLM-L6-v2″,
api_key=”<<<api Key>>>”
)
# Create a ChromaDB client with specified settings
client = chromadb.Client(Settings(
chroma_db_impl=”duckdb+parquet”,
persist_directory=”db/”
))
# Get or create a collection named “goodlog” using the ChromaDB client
coll = client.get_or_create_collection(“goodlog”)
# Define a function to read a log file and add its embeddings to the collection
def read_log_file(filename, addToColl):
indexx = 0
try:
with open(filename, ‘r’) as file:
# Read each line in the file
for line in file:
# Process each log line
indexx += 1
process_log_line(line.strip(), indexx, addToColl)
except FileNotFoundError:
print(f”Error: Log file ‘{filename}’ not found.”)
# Define a function to process each log line and add its embeddings to the collection,
# or query for anomalies based on provided criteria
def process_log_line(log_line, linenumber, addToColl):
if addToColl == 1:
# Add log line embeddings to the collection
logline = hf_ef([log_line])
coll.add(embeddings=logline, documents=log_line, ids=str(linenumber))
elif addToColl == 2:
# Query for anomalies and print them if the distance is greater than 0.5
query_vector = hf_ef([log_line])
res = coll.query(
query_embeddings=query_vector,
n_results=1,
include=[‘distances’],
)
if round(res[‘distances’][0][0], 2) > 0.5:
print(log_line + ” ** this is Anomaly”)
print(res[‘distances’][0][0])
else:
return log_line
# Main program entry point
if __name__ == “__main__”:
# Uncomment the line below to add good log embeddings to the collection
# read_log_file(“./random_log_file.txt”, 1)
# Read log file and search for anomalies
read_log_file(“./random_log_file_error.txt”, 2)