• Featured post

Embeddings, Vector Search & BM25

Un ordenador no puede entender texto ni relaciónes semánticas o significados entre palabras. Solo puede entender números. Esto lo resolvemos mediante el uso de embeddings.

Un embedding es la representación de texto (en forma de números) en un espacio vectorial. Esto permite a los modelos de IA comparar y operar sobre el significado de las palabras.

flowchart TD
    A["perro"] --> B
    B --> C["[-0.003, 0.043, ..., -0.01]"]
    
    N1["(texto que queremos convertir)"]:::note --> A
    N2["(vectores con contenido semántico)"]:::note --> C
    
    classDef note fill:none,stroke:none,color:#777;    

Los vectores de cada palabra o documento capturan el significado semántico del texto.

  • perro estará cerca de mascota
  • contrato estará lejos de playa

Vector vs SQL databases

El problema con las BBDD típicas es que solo buscan matches exactos. Si yo busco por coche solo me sacará las entradas que contengan coche.

En cambio, como las BBDD vectoriales pueden interpretar la semántica de las palabras mediante los vectores, si busco por coche puede sacarme valores como sedán, SUV, Land Rover, etc.

Las BBDD vectoriales son muy buenas cuando necesitamos buscar items similares por proximidad uno respecto al otro. Un ejemplo de uso es buscar películas parecidas (Netflix). Otro ejemplo son los recomendadores de items parecidos en tiendas online (Amazon).

Como ejecutar una búsqueda (query) mediante vectores

(You can see the code here)

Necesitamos:

  • Una BBDD Vectorial (CosmosDB)
  • Un modelo para transformar los embeddings (text-embedding-3-large)

El flujo completo es el siguiente:

  1. Usar un embedding model para obtener los vectores del contenido que queremos indexar
  2. Insertar el texto original y los vectores del contenido en una BBDD vectorial
  3. Cuando queramos ejecutar una query usar el mismo embedding model de antes con la query a buscar. Con el embedding resultante buscamos vectores similares en la BBDD y sacamos el texto original de original_text

    Introducir vectores en CosmosDB

    Para poder buscar necesitamos rellenar antes la BBDD con contenido. Lo mantenemos simple. Metemos

    • un ID a mano
    • el texto original
    • los vectores resultado de hacer el embedding sobre el texto original

El pseudocódigo se ve así y se ejecuta de uno en uno

text = "A shiba walks alone in the park"
# this sends the text to the model text-embedding-3-large 
vectors = createEmbeddingsForText(text)
item = {
	"id": "1",
	"original_text": text,
	"vectors": vectors
}
uploadToCosmosDB(item)

ejemplos de los datos que guardo

{
	"id": "1",
	"original_text": "A shiba walks alone in the park",
	"vectors": [-0.003, 0.043, ..., -0.001]
}

Read More

Advanced SQL

UNION

The union sentence is used to accumulate results for two SELECT sentences.

SELECT column1, column2 FROM table1
UNION
SELECT column1, column2 FROM table2

We have the following tables

company1

per name surname
1 ANTONIO PEREZ
2 PEDRO RUIZ

company2

per name surname
1 LUIS LOPEZ
2 ANTONIO PEREZ

Read More

Count number of entries in filtered table

(for this post some formulas and menu names are in spanish as my excel and computer are in spanish and excel formulas depend on this).

The formula is:

=AGREGAR(3;3;J:J)-1

The first two parameters are for the function itself. The important one is J:J which marks the column to count. What’s important here is this is not going to count filtered items in tables.

Watch out with headers! If you have headers in your table, add -1 to your formula.

Find differences for big dynamic lists in Excel

(for this post some formulas and menu names are in spanish as my excel and computer are in spanish and excel formulas depend on this).

Here is how to find and mark differences in unequal, really long lists or tables in Excel. For my example, one list is a partial list from other. Some items are missing and you’ve to find which ones are.

This is the full list.

Read More

OAuth 2.0

Authentication process of verifying an identity. We confirm they’re who they say they are. (username & pwd).

Authorization process of verifying what someone is allowed to do. (Permissions and access control).

Past solutions

From worst one to best one and the problems they originate:

Credential Sharing

The worst one. An App is not able to differentiate between real user access and programmatical access.
Permissions are typically too broad. It also the ability to access more content than it should.

We could redirect the user off to the API where they could enter their credentials and get a cookie. This allows an app to access the API.

Dangerous because CSRF attacks. We’ve authorised the whole browser and not the app.

Read More

How to solve VirtualBox disk has run out of space

How to solve the problem “Low disk space on ‘Filesystem root’. The volume has only xMB disk space remaining” when you completely fill a virtual disk in VirtualBox.

(You have to delete all your snapshots first)

Open a cmd terminal and run the following command:

"c:\Program Files\Oracle\VirtualBox\VBoxManage.exe" modifymedium  
"c:\Users\mario\VirtualBox VMs\Ubuntu OTAN\Ubuntu OTAN.vdi" --resize 30000

The first path is an executable included with VirtualBox.
The second one is where your VDI actually is. --resize takes the size in MBs.

Open gpartitioner and resize it.

SCRUM PSM1 Certification - Index

TODO: add scrum-psm1-badge image

Status: Certified!

This notes are my watered-down, personal version of The Scrum Guide 2020 and the following Udemy Course: “Preparation For Professional Scrum Master Level 1 (PSM1)” by Vladimir Raykov.

If you want to get ready for the certification exam, I fully recommend buying and watching his course, several times, in Udemy.

Scrum Guide 2020
1. Scrum Guide 2020 Notes
2. Scrum Glossary

“Preparation For Professional Scrum Master Level 1 (PSM1)” by Vladimir Raykov
1. Scrum Introduction
2. The Scrum Team
3. Scrum Events
4. Scrum Artifacts
5. Scrum Practices and Charts
6. A few words before the Exam
7. Recap of key concepts
8. Possible exam questions