In the ever-evolving landscape of cloud-native applications, the need for scalable and resilient architectures has never been more critical. Traditional monolithic systems often fall short when it comes to handling large volumes of data and complex workflows. Enter Dapr: Distributed Application Runtime, a powerful framework that helps you build highly scalable and resilient microservices architecture. In this blog post, we'll explore how you can leverage Dapr to distribute your search ingest pipeline, enhancing scalability, resiliency, and maintainability.
Traditional monolithic architectures often struggle with the demands of processing large volumes of documents for search indexing. Challenges include scalability, resiliency, and efficient orchestration of the various services involved in extracting, processing, and enriching document content. All these services have their own rate limits, which need to be managed carefully with proper back-off and retry strategies optimized for that specific service. Enter Dapr, with its microservices-oriented approach, offering a compelling solution to these challenges.
Aspect | Monolithic Approach | Dapr-based Solution |
---|---|---|
Scalability | ❌ Limited; scaling the entire application can be inefficient | ✅ High; individual components can be scaled independently |
Resiliency | ❌ Retry policies and error handling can be complex | ✅ Improved; easier to manage with Dapr's built-in mechanisms |
Kubernetes Integration | ❌ May require additional configuration | ✅ Kubernetes-native; designed to work seamlessly with k8s |
Monitoring | ❌ Custom setup required for metrics and tracing | ✅ Built-in support for Application Insights and other monitoring tools |
Componentization | ❌ All logic is within a single application | ✅ Logic is distributed across multiple Dapr applications |
Complexity | ✅ Single application to manage | ❌ Multiple applications increase management complexity |
Asynchronous Processing | ❌ Can be challenging to implement and track | ✅ Native support for async operations, but tracking can be complex |
Overhead | ✅ Potentially lower as there's a single runtime | ❌ Additional overhead due to Dapr sidecars and messaging/statestore components |
Dapr shows significant improvements, adding just a bit of complexity and overhead
Dapr facilitates the development of distributed applications by providing building blocks that abstract away the complexities of direct communication between microservices, state management, and resource binding. By leveraging Dapr in a document processing pipeline, each stage of the process becomes a separate Dapr application, capable of independently scaling and recovering from failures.
Consider a workflow designed to ingest, process, and index documents for search. The process involves several stages, from extracting text from documents to enriching the content with metadata, generating embeddings, and ultimately indexing the documents for search. With Dapr, each of these stages can be implemented as a separate microservice, communicating through Dapr's pub/sub messaging and utilizing shared state management and resource bindings.
The workflow typically includes the following stages:
To summarize the flow:
Batcher
is triggered by an HTTP request, retrieving every single document in the given blob path and adding these to a queue.ProcessDocument
is subscribed to this queue, pulling the raw document from blob and extracting its content using Form Recognizer/Document Intelligence, splitting it up into multiple manageable sections. Each section is added to 3 enrichment queues (GenerateEmbeddings
, GenerateKeyphrases
, GenerateSummary
).EnrichmentComplete
.DocumentCompleted
is triggered to notify the section is finished.BatchCompleted
is triggered to notify the Document is fully processed.BatchCompleted
has been triggered for all Documents that needed processing in the pipeline, Azure Search Indexer is started, pulling all sections from Blob to populate the search index.This GitHub repository can serve as inspiration for implementing this flow, including scripts to deploy the infrastructure, local development and deployment scripts for deploying these services to Kubernetes.
Adopting Dapr for your search ingest pipeline can be a game-changer. It offers significant advantages in scalability, resiliency, and maintainability, making it a strategic investment for future-proofing your applications. While it introduces some complexity and overhead, the benefits of a microservices-oriented architecture, particularly in a Kubernetes environment, far outweigh these trade-offs.
Splitting each service into its own Dapr container provides several key advantages:
For more details and to access the deployment scripts, visit the search-ingest-dapr GitHub repository.
]]>I often get the question: how do you manage your blog? Do you pay anything for hosting your blog? Do you have some sort of markdown interpreter? How about Search Engine Optimizations, can your blog posts be found online using Google? The short respective answers: Completely managed through a self-hosted Ghost instance, completely free, with rich editor and out of the box search engine optimizations.
So, how does it work? Let's start with a list of requirements:
ghost-static-page-generator
for creating static page contentAfter you've setup Docker, run Ghost locally with the following docker-compose.yml
, this also sets up a local mysql container for storing your blog posts:
version: '3'
services:
ghost:
container_name: ghost
image: ghost:latest
restart: always
ports:
- 2368:2368
volumes:
- ./content:/var/lib/ghost/content
environment:
url: http://localhost:2368
database__client: mysql
database__connection__host: db
database__connection__user: root
database__connection__password: superS3cretp@ssw0rd
database__connection__database: ghost
db:
container_name: ghost_mysql
image: mysql:8.4
command: mysqld --mysql-native-password=ON --authentication_policy=mysql_native_password
restart: always
environment:
MYSQL_ROOT_PASSWORD: superS3cretp@ssw0rd
volumes:
- ./data:/var/lib/mysql
Note: remember to change your database password to a strong unique password
With this file in-place, run the container with:
docker-compose up -d
Once the initial run is complete, your Ghost instance will be available at http://localhost:2368.
With Ghost running locally, you can convert your blog to static html with gssg
. This creates static html and resources for all the endpoints of your blog, which you can then host on a website. Static sites benefit from faster load times and enhanced security since they don't rely on server-side scripting, making your blog not only quick to access but also less vulnerable to web attacks.
To install gssg
, with Node.js installed on your local machine, globally install gssg
using:
npm install -g gssg
With gssg
installed, you can generate the static site with:
gssg --domain http://localhost:2368 --url https://blog.example.com --dest ./static
With the following parameters
--domain
this is the domain and port hosting your local Ghost instance --url
the external URL where your blog will be hosted (e.g., https://blog.example.com)--dest
the local path of your static website contentTo make your blog accessible to the world, push your static HTML content to a GitHub repository configured with GitHub Pages. GitHub Pages offers a robust, free hosting solution that integrates seamlessly with your GitHub workflow. For most personal and project blogs, it provides a perfect balance between ease of use and functionality.
Follow the instructions below to setup your own GitHub repository with GitHub Pages:
your-github-username.github.io.
For example, mine is https://github.com/bart-jansen/bart-jansen.github.ioTo personalize your blog's URL, you can configure a custom domain by adding a CNAME record. A CNAME record (Canonical Name) redirects visitors from your default GitHub domain to a domain name of your choosing. For instance, instead of your-github-username.github.io
, visitors could reach your blog via www.yourblogname.com
.
Follow instructions in the link above:
bart.je
which will push a CNAME
file to root of your repository with bart.je
The Azure Search Index can be populated in two different ways. You can either directly push data into the index via the REST API/SDK (left image), or leverage the built-in Azure Search Indexer, which pulls data from a chosen DataSource and adds them to a specified index (right image)
Choosing between the pull and push models depends on the requirements you have. Do you care most about how fast your data gets indexed, or is security your top priority? Maybe you need the flexibility to work with many different data sources. Each model has its own set of pros and cons. To help you decide, we've put together a comparison table below that breaks down the differences between the pull and push models across several important areas, including their ability to enrich data with AI, security features, indexing performance, and more.
Aspect | Pull Model | Push Model | Notes |
---|---|---|---|
Enrichment | ✅ Supports AI enrichment through built-in skillsets. | ❌ Does not support AI enrichment capabilities via skillsets. | Manual enrichment before indexing also possible |
Security | ✅ Makes use of Azure AI Search's built-in security. | ❌ All security measures must be implemented by the user. | |
Performance | ❌ Inferior indexing performance compared to push model (currently ~3 times slower)* | ✅ More performant indexing, allows async upload upto 1000 documents per batch. | Indexing performance is key for the overall ingestion duration. |
Monitoring | ✅ Azure AI Search provides built-in monitoring capabilities. | ❌ User needs to monitor the index status manually. | |
Flexibility | ❌ Limited to supported data sources; source needs to be compatible with Azure AI Search Indexer. | ✅ More flexible; can push data from any source. | Azure Blob, Data Lake, SQL, Table Storage and CosmosDB are supported at time of writing. |
Reindexing | ✅ Easily able to reindex with a simple trigger, if data stays in place in DataSource. | ❌ Need to cache documents and manually push each document again | Re-indexing a lot easier with pull model. |
Tooling | ✅ Indexer functionality is exposed in the Azure portal. | ❌ No tool support for pushing data via the portal. |
The comparison between the pull and push models shows that the pull model excels in areas like AI enrichment, built-in security, monitoring, and ease of reindexing, thanks to its integration with Azure AI Search's capabilities. However, it falls short in indexing performance and flexibility, being slower and limited to certain data sources.
Feeding data into your search index, whether through push or pull, significantly impacts the overall time taken for ingestion. As seen from the comparison table above, using the push approach can make the process up to three times faster. These results come from the standard configuration, when using a single Azure Search Indexer and a single thread pushing batches of documents to the search index.
Using a dataset with 10,000 documents, we evaluated the performance of ingestions data with the pull and push models under various configurations:
We'll have a look at the impact of each individual configuration first, and then combine configurations for further optimizations. To ensure consistency in testing conditions and to mitigate the effects of throughput and latency, all data ingestions were conducted within the same Azure Region, originating from an Azure-hosted machine. The results presented are the average of five test runs.
Indexer/Pull | Push | |
---|---|---|
S1 1 partition | 3:10 | 1:05 |
S2 1 partition | 2:51 | 0:57 |
S1 4 partitions | 3:22 | 1:10 |
As shown in the results above, the push method is almost three times faster than using an indexer.Interestingly, the introduction of additional partitions and upgrading to an S2 search instance had a negligible effect on performance in these tests. These results suggest that both the addition of more partitions and upgrading to an S2 search instance primarily enhance capacity and query-performance, rather than directly improving the rate of data ingestion.
To speed things up, we also investigated parallel indexing. For the pull model we can use multiple indexers that write to the same index, and with the push model we can asynchronously push multiple batches simultaneously. Here, we also play with the partition size, to see how that effects the results.
Indexer/Pull | Push | |
---|---|---|
S1 1 parallel | 3:10 | 1:05 |
S1 10 parallel | 0:20 | 0:59 |
S1 20 parallel | 0:11 | 0:49 |
S1 40 parallel | 0:06 | 0:34 |
The performance of the indexer scaled almost linearly with parallelization, showing significant improvements. Specifically, using 10 indexers was approximately 9.5 times faster, 20 indexers around 17.3 times faster, and 40 indexers about 31.2 times faster compared to a single indexer. Although asynchronously pushing batches also enhanced performance, the improvement was roughly 2 times better when using 40 parallel threads for pushing. Increasing parallel threads further had a negative impact and decreased performance for both push and pull.
Choosing between the push and pull models for populating the Azure Search Index should be based on project-specific requirements, including indexing speed, security, source flexibility, AI enrichment potential and the need for re-indexing over time. While the pull model integrates closely with Azure AI Search's advanced features, offering built-in enrichment and easy re-indexing, it lags behind the push model in terms of speed and flexibility.
Using standard configurations, the push model outperforms pull and is three times faster. However, you can setup multiple indexers to linearly improve ingestion speed for each indexer added, minding the maximum limit of indexers. This does require a bit of orchestration, where you need to split your indexes over multiple data sources (or folders) and track the progress of multiple indexers.
]]>
This blog posts shows various
]]>This blog posts shows various code snippets, on how to achieve these different steps using Python. The first bit is to setup and populate the Search Index, and the last sections show how to query the search index and enrich the data using GPT.
Please keep in mind that these snippets are a great starting point, but have been kept as small as possible to fit the format of this blog. Comprehensive processing such as chunking, text splitting, enriching, embedding and semantic configuration, semantic reranking have consciously been kept out of the scope of this blog post, but are required for creating an effective search index.
First step is to create an Azure AI Search Index, this can be done through the Azure Portal, the REST API or with the Python SDK:
SEARCH_SERVICE = "your-azure-search-resource"
SEARCH_INDEX = "your-search-index"
SEARCH_KEY = "your-secret-azure-search-key"
SEARCH_CREDS = AzureKeyCredential(SEARCH_KEY)
SEARCH_CLIENT = SearchIndexerClient(endpoint=f"https://{SEARCH_SERVICE}.search.windows.net/", credential=SEARCH_CREDS)
def create_index():
client = SearchIndexClient(endpoint=f"https://{SEARCH_SERVICE}.search.windows.net/", index=SEARCH_INDEX, credential=SEARCH_CREDS)
# Define the index
index_definition = SearchIndex(
name=SEARCH_INDEX,
fields=[
SearchField(name="id", type=SearchFieldDataType.String, key=True),
SearchField(name="content", type=SearchFieldDataType.String, filterable=True, sortable=True),
SearchField(name="sourcefile", type=SearchFieldDataType.String, filterable=True, facetable=True),
SearchField(
name="embedding",
type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
hidden=False,
searchable=True,
filterable=False,
sortable=False,
facetable=False,
vector_search_dimensions=1536,
vector_search_configuration="default",
)
],
semantic_settings=SemanticSettings(
configurations=[
SemanticConfiguration(
name='default',
prioritized_fields=PrioritizedFields(
title_field=None, prioritized_content_fields=[SemanticField(field_name='content')]
)
)
]
),
vector_search=VectorSearch(
algorithm_configurations=[
VectorSearchAlgorithmConfiguration(
name="default",
kind="hnsw",
hnsw_parameters=HnswParameters(metric="cosine")
)
]
)
)
# Create the index
client.create_index(index=index_definition)
This sets up an Azure AI Search Index with these fields:
id
: ID of the documentcontent
: plain text content of your documentsourcefile
: PDF file used, including page number of this documentembedding
: vectorized embedding of your plain text contentSince we're using a vector embedding field we configure vector_search
, we also setup a default semantic_configuration
and define which fields to use for our (non-vector) content (content
in our case).
With the index in place, we need to split PDFs to process, chunk size & overlap size, form recognizer for OCR/Tables, enrichment.
First, we split the PDFs into single page documents:
import io
from PyPDF2 import PdfFileReader, PdfFileWriter
def split_pdf_to_pages(pdf_path):
"""
Splits a PDF file into individual pages and returns a list of byte streams,
each representing a single page.
"""
pages = []
with open(pdf_path, 'rb') as file:
reader = PdfFileReader(file)
for i in range(reader.getNumPages()):
writer = PdfFileWriter()
writer.addPage(reader.getPage(i))
page_stream = io.BytesIO()
writer.write(page_stream)
page_stream.seek(0)
pages.append(page_stream)
return pages
# Example usage
pdf_path = 'path/to/your/pdf/file.pdf'
pages = split_pdf_to_pages(pdf_path)
We then use the contents of these single-page documents with Azure Form Recognizer (also known as Azure Document Intelligence), to extract the text from the document:
FORM_RECOGNIZER_SERVICE = "your-fr-resource"
FORM_RECOGNIZER_KEY = "SECRET_FR_KEY"
FORM_RECOGNIZER_CREDS = AzureKeyCredential(FORM_RECOGNIZER_KEY)
def get_document_text_from_content(blob_content):
offset = 0
page_map = []
form_recognizer_client = DocumentAnalysisClient(
endpoint=f"https://{FORM_RECOGNIZER_SERVICE}.cognitiveservices.azure.com/",
credential=FORM_RECOGNIZER_CREDS,
headers={"x-ms-useragent": "azure-search-sample/1.0.0"}
)
poller = form_recognizer_client.begin_analyze_document("prebuilt-layout", document=blob_content)
form_recognizer_results = poller.result()
for page_num, page in enumerate(form_recognizer_results.pages):
# Extract text for each page
page_text = page.content
page_map.append((page_num, offset, page_text))
offset += len(page_text)
return page_map
With the page contents extracted from the PDF, per page, we can split the text to chunks, an easy way to do this is:
def split_text(page_map, max_section_length):
"""
Splits the text from page_map into sections of a specified maximum length.
:param page_map: List of tuples containing page text.
:param max_section_length: Maximum length of each text section.
:return: Generator yielding text sections.
"""
all_text = "".join(p[2] for p in page_map) # Concatenate all text
start = 0
length = len(all_text)
while start < length:
end = min(start + max_section_length, length)
section_text = all_text[start:end]
yield section_text
start = end
# Example usage
max_section_length = 1000 # For example, 1000 characters per section
sections = split_text(page_map, max_section_length)
for section in sections:
print(section) # Process each section as needed
Note: the snippet above is a very simplistic way of splitting text. In production you'd want to take into account sentence_endings, overlap, word_breaks, tables, cross-page sections, etc.
We can use the page_map
from get_document_from_text
to call thesplit_text
function and setup the sections for our index:
def create_sections(filename, page_map):
for i, (content, pagenum) in enumerate(split_text(page_map, filename)):
section = {
"id": f"{filename}-page-{i}",
"content": content,
"sourcefile": filename
}
section["embedding"] = compute_embedding(content)
yield section
We can generate embeddings using Azure OpenAI's text-embedding-ada-002
model:
# Configurations
OPENAI_SERVICE = "your-azure-openai-resource"
OPENAI_DEPLOYMENT = "embedding"
OPENAI_KEY = "your-secret-openai-key"
# OpenAI setup
openai.api_type = "azure"
openai.api_key = OPENAI_KEY
openai.api_base = f"https://{OPENAI_SERVICE}.openai.azure.com"
openai.api_version = "2022-12-01"
def compute_embedding(text):
return openai.Embedding.create(engine=OPENAI_DEPLOYMENT, input=text)["data"][0]["embedding"]
With the computed sections from create_sections
we can batch-upload them (in pairs of 1000 documents) into our Search Index:
def index_sections(filename, sections):
"""
Indexes sections from a file into a search index.
:param filename: The name of the file being indexed.
:param sections: The sections of text to index.
"""
search_client = SearchClient(endpoint=f"https://{SEARCH_SERVICE}.search.windows.net/",
index_name=SEARCH_INDEX,
credential=SEARCH_CREDS)
batch = []
for i, section in enumerate(sections, 1):
batch.append(section)
if i % 1000 == 0:
search_client.upload_documents(documents=batch)
batch = []
if batch:
search_client.upload_documents(documents=batch)
# filename and sections from previous steps
index_sections(filename, sections)
The result of this last ingestion step is a fully populated Search Index, ready to be consumed.
Now we can query our Search Index endpoint, using:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
def search_index(query, endpoint, index_name, api_key):
"""
Searches the indexed data in Azure Search.
:param query: The search query string.
:param endpoint: Azure Search service endpoint URL.
:param index_name: Name of the Azure Search index.
:param api_key: Azure Search API key.
:return: Search results.
"""
credential = AzureKeyCredential(api_key)
search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=credential)
results = search_client.search(query)
return [result for result in results]
# Example usage
endpoint = 'https://[service-name].search.windows.net' # Replace with your service endpoint
index_name = 'your-index-name' # Replace with your index name
api_key = 'your-api-key' # Replace with your API key
search_query = 'example search text'
search_results = search_index(search_query, endpoint, index_name, api_key)
for result in search_results:
print(result) # Process each search result as needed
After retrieval, we can enrich the search results with GPT-4:
import openai
def enrich_with_gpt(result, openai_api_key):
"""
Enriches the search result with additional information generated by OpenAI GPT-4.
:param result: The search result item to enrich.
:param openai_api_key: OpenAI API key.
:return: Enriched information.
"""
openai.api_key = openai_api_key
# Construct a prompt based on the result for GPT-4
prompt = f"Based on the following search result: {result}, generate additional insights."
# Call OpenAI GPT-4 to generate additional information
response = openai.Completion.create(engine="gpt4-32k", prompt=prompt, max_tokens=150)
return response.choices[0].text.strip()
# Example usage
openai_api_key = 'your-openai-api-key' # Replace with your OpenAI API key
enriched_results = []
for result in search_results:
enriched_info = enrich_with_gpt(result, openai_api_key)
enriched_results.append((result, enriched_info))
for result, enriched_info in enriched_results:
print("Original Result:", result)
print("Enriched Information:", enriched_info)
print("-----")
As mentioned in the beginning of this blog post, this code snippets can serve as the basis of your RAG ingestion & consumption pipeline. But to come increase the effectiveness of your RAG implementation, there's a lot of improvements that can be made.
]]>In the rapidly evolving world of data analytics and AI, new frameworks are popping up every day to enhance the way we interact with and understand our data. One of the most exciting developments is the integration of tailor-made Large Language Models (LLM) into business processes. These models, like Azure OpenAI's GPT-4, are transforming how companies get insights from their data. In this post, we'll dive into how you can leverage Retrieval-Augmented Generation (RAG) using Azure OpenAI and Azure Cognitive Search to create a CoPilot experience for your data.
Retrieval-Augmented Generation is an architecture that combines the best of two worlds: the knowledge and natural language understanding of LLMs and the precision of an effective search index. While there are other alternatives available, RAG stands out for its ease of use and effectiveness.
As shown in the diagram above, RAG works by first retrieving relevant information from, e.g. a search index, and then using that specific knowledge index from the data sources to generate more informed and accurate responses using GPT. This makes it particularly valuable for businesses looking to extract insights from their existing data, while having the ability to form a proper natural language based response.
Implementing RAG involves two main components: setting up an ingestion pipeline that indexes all your business-specific data and creating a RAG consumption pipeline. Here's how you can do it:
Setting Up the Ingestion Pipeline:
Setting Up the RAG Consumption Pipeline:
An essential part of implementing RAG is testing and optimizing the performance of your Azure Cognitive Search Index. You can use an LLM like GPT-4 to generate question-answer pairs relevant to your data. This approach allows you to test how well your index performs, giving you insights into the effectiveness of your data indexing strategies, chunking, overlap size, and data enrichment processes. A simple approach using synthetic QA generation is described here.
Adding RAG to your data pipeline using Azure OpenAI and Azure Cognitive Search is not just about keeping up with the latest tech trends. It's about unlocking new levels of understanding and insights from your data, tailored specifically to your company's data. With the steps outlined above, you're well on your way to implementing a powerful CoPilot for your data.
]]>The problem with these new models is that they’re only
]]>It’s 2023, everyone is playing around with ChatGPT on OpenAI and a new and improved version of GPT is being released every couple of months with lots of new improvements which make you even more effective.
The problem with these new models is that they’re only available for customers with OpenAI plus subscriptions, setting you back $20 each month (excluding tax). Instead of using OpenAI’s version, you can also setup your own gpt-4 instance on Azure OpenAI .
Luckily there are lots of open source solutions that have you covered. All very much inspired by ChatGPT’s interface, showing a prompt history, word per word output and nice formatting.
A personal favorite of mine is chatbot-ui. I run this as a docker container on a raspberry pi in my local network, accessible only within my local network or with a proper VPN connection:
docker run \
-e OPENAI_API_KEY=YOURKEY \
-e AZURE_DEPLOYMENT_ID=YOURDEPLOYMENTNAME \
-e OPENAI_API_HOST=https://YOURENDPOINT.openai.azure.com \
-e OPENAI_API_TYPE=azure \
-e DEFAULT_MODEL=gpt-4-32k \
-p 3000:3000 \
bartmsft/chatbot-ui:main
*don't ever run UIs like this, with your OpenAI GPT keys, on the public internet
Simply replace your OPENAI_API_KEY
, AZURE_DEPLOYMENT_ID
and OPENAI_API_HOST
with respectively your primary key, deployment-name and Azure OpenAI endpoint. Your UI will be available on http://localhost:3000.
Update May 2024: As of earlier this year, the original maintainer of chatbot-ui moved to a new interface, adding more features but also adding lots of extra dependencies. Therefore I'm using a personal fork on Docker Hub - bartmsft of the old instance.]]>
Before we dive in, let's quickly define what a managed identity is. In Azure, a managed identity is a service principal that's automatically managed by Azure. It provides an identity for applications to use when connecting to resources that support Azure Active Directory (Azure AD) authentication.
While Role-Based Access Control (RBAC) is a common practice in many Azure services to manage access, when it comes to Databricks, there are specific reasons to opt for Databricks' own access control management:
In summary, while RBAC is a valuable tool for many Azure services, when working within Databricks, leveraging its custom access control mechanisms can offer more precise, flexible, and streamlined management of permissions and access.
In order to programatically setup these permissions we can leverage the SCIM API to assign a managed identity and assign it to a group. These groups can have specific entitlements for specific workspaces. One way to do this is shown in the diagram below:
To set up a managed identity for your existing Web App, you'll need to follow these steps:
After creating the managed identity, you'll need to assign it to your Databricks workspace:
1. In the Azure portal, go to your Databricks workspace.
2. In the Admin settings (right top), go to the "Service principals" section.
3. Click on Add Service Principal, add the Application ID you copied earlier and tick "Allow workspace access" to ensure the Service Principal has sufficient privileges to run notebooks.
Now that we have a managed identity set up and assigned to Databricks, we can use it to run Databricks notebooks from a web application. Here's how:
Here's a sample code snippet in Python that shows how to do this:
from azure.identity import DefaultAzureCredential
from azure.databricks import DatabricksClient
# Get a token for the managed identity
credential = DefaultAzureCredential()
token = credential.get_token('2ff814a6-3304-4ab8-85cb-cd0e6f879c1d')
# Create a Databricks client
client = DatabricksClient(base_url='https://<databricks-instance>', token=token.token)
# Create a new job
job_id = client.jobs.create({'notebook_task': {'notebook_path': '/path/to/notebook'}})
# Start the job
client.jobs.run_now(job_id)
With this setup, you can securely run Databricks notebooks from your web application using a managed identity. The same applies for other services that support managed identity (e.g. Azure Functions, Azure Container Instances, AKS, etc) and the same also applies for User Assigned Managed Identities instead of System Assigned Managed Identities.
Instead of configuring the managed identity manually, we can leverage the azure and databricks terraform providers to automate the creation of the managed identity and attaching it to a databricks cluster. The databricks providers uses the SCIM API under the hood, and here's an example how to tie everything together in terraform:
provider "azurerm" {
features {}
}
data "azurerm_databricks_workspace" "workspace" {
name = "your-databricks-workspace-name"
resource_group_name = "your-databricks-rg-name"
}
provider "databricks" {
host = data.azurerm_databricks_workspace.workspace.workspace_url
azure_workspace_resource_id = data.azurerm_databricks_workspace.workspace.id
}
resource "azurerm_resource_group" "example" {
name = "example-resources"
location = "West Europe"
}
resource "azurerm_service_plan" "example" {
name = "example"
resource_group_name = azurerm_resource_group.example.name
location = azurerm_resource_group.example.location
os_type = "Linux"
sku_name = "P1v2"
}
resource "azurerm_linux_web_app" "example" {
name = "example"
resource_group_name = azurerm_resource_group.example.name
location = azurerm_service_plan.example.location
service_plan_id = azurerm_service_plan.example.id
site_config {}
identity {
type = "SystemAssigned"
}
}
/*
The code below adds the Managed Identity service principal to the databricks
admin group using the Databricks Terraform Provider.
This is required to allow the WebApp to access the Databricks REST API.
*/
data "databricks_group" "admin" {
display_name = "admin" // existing admin group in databricks workspace
}
resource "databricks_service_principal" "sp" {
application_id = azurerm_linux_web_app.example.identity[0].client_id
display_name = "appsvc-id"
}
resource "databricks_group_member" "admin_group" {
group_id = data.databricks_group.admin.id
member_id = databricks_service_principal.sp.id
}
Managed identities offer a secure and simple way to authenticate to services like Databricks. By using a managed identity, you can eliminate the need to store sensitive credentials in your code and simplify your authentication process. Assigning the managed identity to Databricks, allows us to have granular control on what the identity (i.e. Web App) can access and this will allow you to e.g. trigger Databricks workloads from other resources in Azure.
]]>Azure Kubernetes Service (AKS) comes with a Cluster Autoscaler (CA) that can automatically add nodes to the node pool based on the load of the cluster (based on CPU/memory usage). KEDA is active on the pod-level and uses Horizontal Pod Autoscaling (HPA) to dynamically add additional pods based on the configured scaler. CA and KEDA therefore go hand-in-hand when managing dynamic workloads on an AKS cluster since they scale on different dimensions, based on different rules, as shown below:
An overview of KEDA that scales an App
based on the Topic Queue size of Azure Service Bus
is shown in this diagram:
The app is deployed together with a KEDA-backed ScaledObject
. This ScaledObject supports minReplicaCount
and maxReplicaCount
that defines the range of concurrent replicas for the pods that can exist for the app. Furthermore, a scale trigger
object is defined inside the ScaledObject that defines the scaling criteria and conditions for scaling up and down.
Although optional, the diagram above also uses Pod Identity. Similar to how secrets are fetched from Azure Key Vault inside containers, Pod Identity is used with KEDA to directly subscribe to an e.g. Azure Service Bus Topic to scale the pods without passing any connecting strings by specifying its AzureIdentityBinding
.
The KEDA Helm Chart needs to be installed on the AKS cluster and configured to use AzureIdentityBinding
to access resources in Azure. This Azure AD Identity needs to have sufficient RBAC permissions to directly access the required resources in Azure.
The ScaledObject
is defined as follows, which is deployed along with the application deployment specified under scaleTargetRef
. This needs to match the deployment name which needs to be deployed in the same Kubernetes namespace.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: msg-processor-scaler
spec:
scaleTargetRef:
name: msg-processor # must be in the same namespace as the ScaledObject
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: azure-servicebus
metadata:
namespace: SERVICE_BUS_NAMESPACE
topicName: SERVICE_BUS_TOPIC
subscriptionName: SERVICE_BUS_TOPIC_SUBSCRIPTION
messageCount: "5"
authenticationRef:
name: trigger-auth-service-bus-msgs
It defines the type
of scale trigger we would like to use and the scaling criteria specified under the metadata
object. Lots of different KEDA scalars are available out of the box and details can be found by going through the KEDA documentation.
In the example above, we use an azure-servicebus scalar and would like to scale out if there are more than 5 unprocessed messages on the topic subscription SERVICE_BUS_TOPIC_SUBSCRIPTION
on the SERVICE_BUS_TOPIC
topic in the SERVICE_BUS_NAMESPACE
kubernetes namespace. Scaling will go up to a maximum of 10 concurrent replicas which is defined via maxReplicaCount
and there will always be a 1 pod minimum as defined by minReplicaCount
.
Since we are using Pod Identity, we also specify the authenticationRef
for the ScaledObject totrigger-auth-service-bus-msgs
. This is a TriggerAuthentication
resource that defines how KEDA should authenticate to get the metrics.
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: trigger-auth-service-bus-msgs
spec:
podIdentity:
provider: azure
In this case, we are telling KEDA to use Azure as a Pod Identity provider which uses Azure AD Pod Identity. Alternatively, a full connection string can also be used without specifying a TriggerAuthentication resource.
By using a TriggerAuthentication you can easily re-use this authentication resource, but it also allows you to separate the permissions for KEDA and other resources inside your kubernetes cluster by binding them to different Azure AD Identities with different RBAC permissions.
The example above shows how to configure KEDA for autoscaling using Azure Service Bus Topics, but lots of other scalars are supported out of the box and more information can be found on KEDA Documentation - Scalars .
Note: when adding additional triggers, also ensure that Pod Identity can read from these resources by adding the corresponding RBAC permissions.
If you have deployed a fully configured AKS cluster to Azure and you are also using the accompanying Umbrella Helm chart for easy deployment of your apps then you're in luck, because it also allows you to easily add KEDA support for the applications you deploy to the AKS cluster.
Most of the documentation to get started is available on GitHub, but here's a sample on how you would deploy the application described above using this Helm Chart.
helm-app:
app:
name: app-service-name
container:
image: hello-world:latest
port: 80
keda:
enabled: true
name: app-service-name-keda
authRefName: auth-trigger-app-service-name
scaleTargetRef: app-service-name
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: azure-servicebus
metadata:
topicName: sbtopic-app-example-service
subscriptionName: sbsub-app-example-service
namespace: servicebus-app-example-ns
messageCount: 5
Using this Helm Chart, you can easily deploy your Service, Deployment, Scaled Object, AuthenticationTriggers and optionally all other resources (e.g. ingress, secretstore) your deployment requires.
]]>Setting up an Azure Kubernetes Service (AKS) using Terraform, is fairly easy. Setting up a full-fledged AKS cluster that can read images from Azure Container Registry (ACR), fetch secrets from Azure Key Vault using Pod Identity while all traffic is routed via an AKS managed Application Gateway is much harder.
To save others from all the trouble I encountered while creating this one-click-deployment, I've published a GitHub repository that serves as a boilerplate for the scenario described above, and fully deploys and configures your Azure Kubernetes Service in the cloud using a single terraform deployment.
This blog post goes into the details of the different resources that are deployed, why these specific resources are chosen and how they tie into each other. A future blog post will build upon this and explain how Helm can be used to automate your container deployments to the AKS cluster.
Azure Kubernetes Service (AKS) makes it simple to deploy a managed Kubernetes cluster in Azure. Kubernetes is an open-source container orchestration platform that automates many of the manual processes involved in deploying, managing, and scaling containerized applications. Having this cluster is ideal when you want to run multiple containerized services and don't want to worry about managing and scaling them.
Azure Container Registry (ACR) is a managed Docker registry service, and it allows you to store and manage images for all types of container deployments. Every service can be pushed to its own repository in Azure Container Registry and every codebase change in a specific service can trigger a pipeline that pushes a new version for that container to ACR with a unique tag.
AKS and ACR integration is setup during the deployment of the AKS cluster with Terraform. This allows the AKS cluster to interact with ACR, using an Azure Active Directory service principal. The Terraform deployment automatically configures RBAC permissions for the ACR resources with an appropriate ACRPull
role for the service principal.
With this integration in place, AKS pods can fetch any of the Docker images that are pushed to ACR, even though ACR is setup as a private docker registry. Don't forget to add the azurecr.io prefix for the container and specify a tag. It is best practice to not use that :latest
tag since this image always points to the latest image pushed to your repository and might introduce unwanted changes. Always pinpoint the container to a specific version and update that version in your yaml
file when you want to upgrade.
A simple example for a pod that's running a container from the youracrname.azurecr.io
container registry, the test-container
repository with tag 20210301
, is shown below:
---
apiVersion: v1
kind: Pod
metadata:
name: test-container
spec:
containers:
- name: test-container
image: youracrname.azurecr.io/test-container:20210301
AAD Pod Identity enables Kubernetes applications to access cloud resources securely with Azure Active Directory. It's best practice to not use fixed credentials within pods or container images, as they are at risk of exposure or abuse. Instead, we're using pod identities to request access using Azure AD.
When a pod needs access to other Azure services, such as Cosmos DB, Key Vault, or Blob Storage, the pod needs access credentials. You don't manually define credentials for pods, instead they request an access token in real time, and can use it to only access their assigned services that are defined for that identity.
Pod Identity is fully configured on the AKS cluster when the Terraform script is deployed, and pods inside the AKS cluster can use the preconfigured pod identity by specifying the corresponding aadpodidbinding
pod label.
Once the identity binding is deployed, any pod in the AKS cluster can use it by matching the pod label as follows:
apiVersion: v1
kind: Pod
metadata:
name: demo
labels:
aadpodidbinding: $IDENTITY_NAME
spec:
containers:
- name: demo
image: mcr.microsoft.com/oss/azure/aad-pod-identity/demo:latest
Some of the services in the AKS cluster connect to external services. The connection strings and other secret values that are needed by the pods are stored in Azure Key Vault. By storing these variables in Key Vault, we ensure that these secrets are not versioned in the git repository as code, and not accessible to anyone that has access to the AKS cluster.
To securely mount these connection strings, pod identity is used to mount these secrets in the pods and make them available to the container as environment variables. The flow for a pod fetching these variables is shown in the diagram below:
Luckily, there's no need to to worry about all these different data flows, and we can just use deploy the provided Azure Key Vault Provider for Secrets Store CSI Driver on the AKS cluster. This provider leverages pod identity on the cluster and provides a SecretProviderClass
with all the secrets that you want to fetch from Azure Key Vault.
A basic example for setting up a SecretStore from KeyVault kvname
, getting secret secret1
and exposing these as a SecretStore named kv-secrets
:
apiVersion: secrets-store.csi.x-k8s.io/v1alpha1
kind: SecretProviderClass
metadata:
name: kv-secrets
spec:
provider: azure
parameters:
usePodIdentity: "true" # set to "true" to enable Pod Identitiy
keyvaultName: "kvname" # the name of the KeyVault
objects: |
array:
- |
objectName: secret1 # name of the secret in KeyVault
objectType: secret
tenantId: "" # tenant ID of the KeyVault
To use these secret secret1
from kv-secrets
in a pod and making it available in the nginx container with the environment variable SECRET_ENV
:
spec:
containers:
- image: nginx
name: nginx
env:
- name: SECRET_ENV
valueFrom:
secretKeyRef:
name: kv-secrets
key: secret1
All traffic that accesses the AKS cluster is routed via an Azure Application Gateway. The Application Gateway acts as a Load Balancer and routes the incoming traffic to the corresponding services in AKS.
Specifically, Application Gateway Ingress Controller (AGIC) is used. This Ingress Controller is deployed on the AKS Cluster on its own pod. AGIC monitors the Kubernetes cluster it is hosted on and continuously updates an Application Gateway, so that selected services are exposed to the Internet on the specified URL paths & ports straight from the Ingress rules defined in AKS. This flow is shown in the diagram below:
As shown in the overview diagram, the client reaches a Public IP endpoint before the request is forwarded to the Application Gateway. This Public IP is deployed as part of the Terraform deployment on an Azure-based FQDN (e.g. your-app.westeurope.cloudapp.azure.com
).
Lots of details on the inner workings for some of these resources. Fortunately, all of these configurations happen for you and all you need to do is setup the tfvars
variables for your environment. Keep an eye out for the upcoming blog post on setting up Helm to automate your container deployments.
What are you waiting for? Clone the repo and start deploying your fully configured AKS cluster.
]]>Last week, we teamed up with 20 different developers for a hackathon to create a chat bot for the Eurosonic Noorderslag festival. Eurosonic Noorderslag is an annual festival for new and upcoming artists with over 40,000 attendees. The purpose of our bot is to provide the attendees with near real-time answers to everything related to the festival. Together with these developers we built a fully functioning bot using the Azure Bot Framework.
In this blog I'll briefly introduce the technology we have used for the bot, and I'll go over our approach for getting the chat bot ready in two days without any prior experience with creating bots.
The fundament of our bot is based on the Azure Bot Framework. This framework comes with several advantages:
New bots can be either created via botframework.com or in the Azure Portal. When you create a bot an easy walkthrough is provided to pick a programming language (C# or Node.JS), set up a basic template for your desired functionality, connect to the appropriate chat platforms and provide connectors to enabled external providers (e.g. LUIS).
The biggest reasons why bots are so powerful today is because of the newly trained language interpretation systems behind the bots. Microsoft's flavor for this is known as Language Understanding Intelligent Service (LUIS) and it serves as the interpretation system between the text input and the data processing used to provide an output that makes sense. The different capabilities are explained in the remainder of this section and illustrated in the figure below:
Whenever LUIS is queried with a sentence, it first tries to determine its intent. For example, if you have a GetFood intent to return food places, you want this functionality to trigger with a variety of sentences, like 'I am hungry', 'I want food', 'I am starving' etc. Completely different sentences as you can see, but because of the built-in language interpretation, LUIS knows there's a relation between starving/hungry/want food and LUIS will trigger the GetFood intent for all three sentences.
Another building block which LUIS provides are entities. Whenever a sentence is put through LUIS, it can extract certain keywords in these sentences and pass them to the BotFramework as separate entities. Think of the earlier example to get food. When a user for example asks 'I want pizza' or 'I am looking for a restaurant that serves orange juice', LUIS can be trained to detect food entities and extract relatively pizza and orange juice from these two sentences.
Lastly, phrase lists in LUIS are ideal to link certain keywords to each other. Different food entities as described in the previous section, can be easily linked together with phrase lists so you don't have to train LUIS to detect every type of food. Instead, you can train LUIS to recognize e.g. pizza, and add a comma-seperated food phrase list with all the different types of food you wish to detect.
Since we only had two days to finish the bot, we didn't organize a traditional hackathon where each team competes with each other for the best idea. We decided to collaborate with each other and to split the group up into different teams for different features. We also had an additional team which went out on the street to ask the festival attendees what functionalities they would be looking for when talking with the festival bot.
Even though every team worked on different functionalities, every functionality got merged into the same GitHub repository. After 30 hours, +300 commits and a lot of last minute bug fixing we managed to create our chat bot: Sonic.
To experience the bot yourself, you can try out the bot on Facebook Messenger. Or have a look at the video below showing most of its functionalities:
Before joining Microsoft a little over three months ago, I was not familiar with the Azure Cloud whatsoever and I replaced my Windows machine with an Apple MacBook several years ago without ever looking back. I have been a software engineer for the last 10 years, but when working in the cloud I have always developed in Amazon's AWS Cloud while working at various startups.
When I signed up to become a Technical Evangelist at Microsoft I knew this was about to change. I spent the last month and a half onboarding to Microsoft's Windows OS and Azure Cloud. Five years ago, I wouldn't have even considered joining MSFT, but due to Microsoft's new vision, focus on Cloud productivity and innovation I'm proud to call myself a Technical Evangelist at Microsoft.
In this blog I will be sharing my Azure onboarding experiences while trying to stay unbiased ;)
I think my ongoing journey can be described in a simple excitement-over-time graph shown below which I will try to explain in further detail in the remainder of this blog:
When I first opened the Azure Portal, the first thing I noticed is the extensive amount of options, items and resources I could pick from which felt really overwhelming.
AWS focuses on getting everything done via the Command Line Interface (CLI), whereas Azure tends to put more focus on its UI-based Web Portal. Even though the latter is nice for less experienced developers, it takes some getting used to.
Luckily Azure fully functions via the Command Line Interface as well (azure-cli), but does not seem to promote it in the way that Amazon does. Every tutorial/webinar/blog seems to use the Portal-approach and guide you through their code with various Portal screenshots. This is why it actually took me a couple of days before I found out there was a CLI :)
Yes, Every. It just takes some getting used to since they both use different naming conventions. A useful chart that helped me out a lot can be found here. AWS' EC2 and S3 can be found in Azure under respectively Virtual Machines and Azure Storage. And even more specific functionality, e.g. serverless computing where AWS uses Lambda can be found in Azure under Function Apps.
The biggest concern I had when I joined Microsoft was that I had no .NET/C# experience whatsoever. A language and environment that have been going hand-in-hand with the Microsoft ecosystem. Fortunately, this does not hold me back in my productivity and programming capabilities at all in these days at Microsoft. The time where you could only run your code in sandboxed Windows Machines supporting solely ASP.NET with SQL server is over. Microsoft embraces every kind of programming and honestly does not care whether you are running a .NET application with SQL or a NodeJS application with MongoDB as a backend.
Historically, the path that Microsoft took is rather surprising. Especially when we look at a quote of Microsoft's former CEO Steve Ballmer in 2001:
"Linux is a cancer that attaches itself in an intellectual property sense to everything it touches"
Times sure have changed and even Ballmer seems to agree. At the time, Microsoft was fighting against the open source Linux community but as of lately actually embraces the whole OSS scene. Being the largest open source contributor on GitHub, SQL server running on Linux, Ubuntu running on Windows 10 and the list goes on. And it does result in some crazy setups which were unthinkable a couple of years ago:
Of course there are still scenarios where you wish you grew up in the MSFT ecosystem. One of the interesting companies that Microsoft acquired is Xamarin. Xamarin allows you to build cross platform mobile applications written in one programming language (instead of 3 different ones). Unfortunately for me, this language is C# ;)
One thing that I was used to on Amazon's AWS cloud was setting up Virtual Machines to host my web services. Even though AWS fully supports PaaS services, I have never experimented with this. With Azure, the focus truly lies on PaaS and even SaaS and it's currently trying to win ground for IaaS solutions. That said, a frictionless migration is possible for all your VMs. Furthermore, Azure offers a lot of PaaS services, such as the App Service which automagically maintains, scales up/down and provides lots of insight for your web services.
Another great example of easy-to-use technology, are Microsoft Cognitive Services. This is an attempt to democratize the otherwise extremely complex AI and Machine Learning possibilities. Cognitive Services exposes simple APIs which developers can leverage to use extremely well-trained models to interpret & analyze images, audio and video. The nice thing about this, is that it does not require a hardcore developer to use these services. Anyone with little programming experience can leverage these models.
One of the things that bothered me a lot in the beginning is that the Azure Portal tends to be slow at times. I however later found out that this is because I am using an internal Microsoft subscriptions which is known to be less fast. Not just the interface felt slow, but I also feel that deploying new instances in Azure takes a bit longer than necessary imho. Deploying a cluster of Azure Container Services easily takes up to 20mins which I don't fully understand because all they need to do is copy over a bunch of images, right?
Another thing which I noticed is that Azure does suffer from occasional outages. Even though these outages usually only affect specific regions, I believe this isn't something that should still happen anno 2016. With all that said, I do certainly think that these two cons outweigh the pros. With the PaaS/SaaS solutions, the openness in which you can use any language/toolset and the large amount of capabilities that Azure brings, I can definitely say I'm hooked.
One big aspect which I do not consider in this blog, is the money aspect of it all which can be a really decisive argument for choosing a Cloud to build on.
Even more important is the actual cloud performance you are getting. Even though certain performance indicators are advertised, I have not analyzed whether both AWS and Azure live up to these standards.
]]>