• Featured post

Embeddings, Vector Search & BM25

Un ordenador no puede entender texto ni relaciónes semánticas o significados entre palabras. Solo puede entender números. Esto lo resolvemos mediante el uso de embeddings.

Un embedding es la representación de texto (en forma de números) en un espacio vectorial. Esto permite a los modelos de IA comparar y operar sobre el significado de las palabras.

flowchart TD
    A["perro"] --> B
    B --> C["[-0.003, 0.043, ..., -0.01]"]
    
    N1["(texto que queremos convertir)"]:::note --> A
    N2["(vectores con contenido semántico)"]:::note --> C
    
    classDef note fill:none,stroke:none,color:#777;    

Los vectores de cada palabra o documento capturan el significado semántico del texto.

  • perro estará cerca de mascota
  • contrato estará lejos de playa

Vector vs SQL databases

El problema con las BBDD típicas es que solo buscan matches exactos. Si yo busco por coche solo me sacará las entradas que contengan coche.

En cambio, como las BBDD vectoriales pueden interpretar la semántica de las palabras mediante los vectores, si busco por coche puede sacarme valores como sedán, SUV, Land Rover, etc.

Las BBDD vectoriales son muy buenas cuando necesitamos buscar items similares por proximidad uno respecto al otro. Un ejemplo de uso es buscar películas parecidas (Netflix). Otro ejemplo son los recomendadores de items parecidos en tiendas online (Amazon).

Como ejecutar una búsqueda (query) mediante vectores

(You can see the code here)

Necesitamos:

  • Una BBDD Vectorial (CosmosDB)
  • Un modelo para transformar los embeddings (text-embedding-3-large)

El flujo completo es el siguiente:

  1. Usar un embedding model para obtener los vectores del contenido que queremos indexar
  2. Insertar el texto original y los vectores del contenido en una BBDD vectorial
  3. Cuando queramos ejecutar una query usar el mismo embedding model de antes con la query a buscar. Con el embedding resultante buscamos vectores similares en la BBDD y sacamos el texto original de original_text

    Introducir vectores en CosmosDB

    Para poder buscar necesitamos rellenar antes la BBDD con contenido. Lo mantenemos simple. Metemos

    • un ID a mano
    • el texto original
    • los vectores resultado de hacer el embedding sobre el texto original

El pseudocódigo se ve así y se ejecuta de uno en uno

text = "A shiba walks alone in the park"
# this sends the text to the model text-embedding-3-large 
vectors = createEmbeddingsForText(text)
item = {
	"id": "1",
	"original_text": text,
	"vectors": vectors
}
uploadToCosmosDB(item)

ejemplos de los datos que guardo

{
	"id": "1",
	"original_text": "A shiba walks alone in the park",
	"vectors": [-0.003, 0.043, ..., -0.001]
}

Read More

Java Index

This are my Java-related notes. Here I have all the knowledge I refer to when I have doubts about how to use or how to implement a framework / feature I’ve already implemented once.

Version changes

Interesting changes, new functionality and APIs that come to Java with each new version. They don’t include the full changes but the ones I deemed most useful or most interesting.

From Java 8 to Java 11
Java12
Java13

Experience

Small, functional snipets on how to implement a determined feature.

Java experience sheet
How to create a database intermediate table
Java date time API
New script files in Java

Frameworks

How to use and implement determined frameworks in a Java project (using Maven).

Spring in Action (Book)
Spring Cache
Spring Beans
Thymeleaf
Spring Cors

Maven (builder)
Testing (JUnit, TestNG, Mockito)
Vert.x (microservices)
Lombok (builder)
MapStruct (mapper)
Liquibase (database version control)

Splunk

Splunk take any type of data of millions of entries and allows you to process it into reports, dashboards and alerts.

It’s great at parsing machine data. We can train Splunk to look for certain patterns in data and label those patterns as fields.

Planning Splunk Deployments

A note on config files

Everything Splunk does is governed by configuration files. They’re stored in /etc and they’ve .conf extension.

They’re layered. You can have files with the same name in several directories. You might have a global level conf file and an app specific conf file. Splunk check which one to use based on the current app.

Read More

Oracle 1Z0-819 (Java11) Certification - Index

The new 1Z0-819 certification is the combination of the old existing ones (1Z0-815 & 1Z0-816) together.

OCP Java SE 11 Programmer I - Study guide for 1Z0-815

Welcome to Java
Java Building Blocks
Java Operators
Making Decisions
Core Java APIs
Lambdas and Functional Interfaces
Methods and Encapsulation
Class Design
Advanced Class Design
Exceptions
Java Modules

OCP Java SE 11 Programmer II - Study guide for 1Z0-816

Java Fundamentals
Java Annotations
Generics and Collections

Google Cloud Developer Certification - Index

This are personal notes for the GCP Developer certification. If you want to get ready, I fully recommend doing Qwiklabs and Coursera courses to prepare yourself.

Google Cloud Platform (GCP) Fundamentals: Core Infrastructure Introducing Google Cloud Platform
Getting started with GCP
Virtual machines in the cloud
Storage in the cloud
Containers in the cloud
Applications in the cloud
Developing in the cloud
Big Data in the cloud
Machine Learning in the cloud

Getting started with Application Development Best practices for app development
Google Cloud SDK, Client Libraries and Firebase SDK
Data Storage Options
Best practices for Cloud Datastore
Best practices for Cloud Storage

Securing and Integrating Components of your Application Cloud IAM (Identity and Access Management)
OAuth2.0, IAP and Firebase Authentication
Cloud Pub/Sub (needs cleaning)
Cloud Functions (needs cleaning)
Cloud Endpoints (needs cleaning)

App deployment, Debugging and Performance Deploying Applications (needs cleaning)
Execution Environments for your App (needs cleaning)
Debugging, Monitoring and Tuning Performance (needs cleaning)

Course Qwiklabs Setting up a development environment

Extra Qwiklabs

Using the Cloud SDK Command Line

Getting started with Cloud Shell and gcloud
Configuring networks with gcloud
Configuring IAM permissions with gcloud
gsutil commands for Buckets
gsutil commands for BigQuery

From Java to Android with Kotlin

(Disclaimer: This are my personal notes from following Kotlin and Android courses in Udemy. This is a watered-down version from those courses. Check and buy the original courses if you want to find the full resources I used with more detail)

Android

This are my notes on the progress of things I had to learn to go from Java Developer to develop my first Android App with Android in Kotlin.

ViewBinding
DataBinding
MVVM Architecture
Live Data
ViewModel, LiveData, DataBinding
(wip: I still have to order and clean this series of posts from here on)
Recycler View
Navigation Architecture Component
Android Notifications
Coroutines
WorkManager
Android Testing

Extras:
Dagger2 Framework (dependency injection)
Hilt Framework (Dagger2 wrapper)
Room Framework (SQLite)
Android SQLite experience sheet
Android Development experience

Kotlin

This series of posts explain the main differences in language structures and usage between Kotlin and Java languages. I don’t explain the full Kotlin language, but the novelties that Kotlin implements that may be of interest to a Java developer.

From Java to Kotlin - Data Types & Casting
From Java to Kotlin - Operators & Operators Overloading
From Java to Kotlin - Nullable Types & Null Checks
From Java to Kotlin - Control Flow
From Java to Kotlin - Functions, Varargs & Default Parameters
From Java to Kotlin - Standard Library Functions
From Java to Kotlin - Lambdas
From Java to Kotlin - OOP, Companion Objects & Destructuring in Kotlin
From Java to Kotlin - Exceptions & Collections

Extras:
Kotlin cheat sheet with code examples

Scrapy (Python web crawler)

Scrapy is a web-scrapper & crawler.

Concepts

spider: class that you define and scrapy uses to scrape information from a website (our a group of websites). They must define the initial requests to make, optionally how to follow links in the pages and how to parse the content to extract data

item pipeline: after an item has been crawled by a spider, it’s sent to the item pipeline which processes it through several components that are executed sequentially. You can use them, for example, to save items to a database

How to use

# create a new project
scrapy startproject your_project_name  

# after writing a spider, it starts the crawl
scrapy crawl quotes

Read More

React JS

JavaScript library for building user interfaces. Created by Facebook.

Yarn

JavaScript package manager compatible with npm that helps automate the process of installing, updating, configuring, and removing npm packages.

Install

# add Yarn repository
curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -  

echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list  

sudo apt-get update  
sudo apt-get install nodejs yarn  
yarn --version # verify

Read More

Programming Templates

Which problem does it solve?

  1. Whenever I started a new project, I would spend the first few days setting everything up - Java11 (.NET with c# nowadays), maven configuration, docker, microservices communication, databases config etc. This process took way too much time.

  2. When learning a new programming language or framework, by the time I needed to use it, I had often forgotten how to set everything up. This approach saves me time in the long run and forces me to really learn how to use the new technology.

  3. For technical tests during job hunting, it allows me to save time and focus entirely on the code challenge.

Github Repository

Jekyll

Jekyll is a blog-aware static site generator, written in Ruby. It’s used for Github Pages and it transforms files written in markdown and liquid into a full HTML web.

Installation

Pre-requirements

sudo apt-get install ruby-full build-essential zliblg-dev
sudo gem install jekyll bundler

Configuration

The basic config is under _config.yml

# shows any config mishap
bundle exec jekyll doctor

Read More

Docker best practices

List of things to do, to improve your Docker experience

Never map the public port on a DockerFile

If you map it, you’ll only be able to have one instance of this container running. If the user wants to map the port, he’ll be able to do it in a compose script or with -p option.

# public and private mapping
EXPOSE 80:8080 # don't do this

# private mapping
EXPOSE 80

Read More