Distributing your Search Ingest pipeline using Dapr

In the ever-evolving landscape of cloud-native applications, the need for scalable and resilient architectures has never been more critical. Traditional monolithic systems often fall short when it comes to handling large volumes of data and complex workflows. Enter Dapr: Distributed Application Runtime, a powerful framework that helps you build highly scalable and resilient microservices architecture. In this blog post, we'll explore how you can leverage Dapr to distribute your search ingest pipeline, enhancing scalability, resiliency, and maintainability.

Why Dapr?

Traditional monolithic architectures often struggle with the demands of processing large volumes of documents for search indexing. Challenges include scalability, resiliency, and efficient orchestration of the various services involved in extracting, processing, and enriching document content. All these services have their own rate limits, which need to be managed carefully with proper back-off and retry strategies optimized for that specific service. Enter Dapr, with its microservices-oriented approach, offering a compelling solution to these challenges.

Comparison

Aspect	Monolithic Approach	Dapr-based Solution
Scalability	❌ Limited; scaling the entire application can be inefficient	✅ High; individual components can be scaled independently
Resiliency	❌ Retry policies and error handling can be complex	✅ Improved; easier to manage with Dapr's built-in mechanisms
Kubernetes Integration	❌ May require additional configuration	✅ Kubernetes-native; designed to work seamlessly with k8s
Monitoring	❌ Custom setup required for metrics and tracing	✅ Built-in support for Application Insights and other monitoring tools
Componentization	❌ All logic is within a single application	✅ Logic is distributed across multiple Dapr applications
Complexity	✅ Single application to manage	❌ Multiple applications increase management complexity
Asynchronous Processing	❌ Can be challenging to implement and track	✅ Native support for async operations, but tracking can be complex
Overhead	✅ Potentially lower as there's a single runtime	❌ Additional overhead due to Dapr sidecars and messaging/statestore components

Dapr shows significant improvements, adding just a bit of complexity and overhead

How does it work?

Dapr facilitates the development of distributed applications by providing building blocks that abstract away the complexities of direct communication between microservices, state management, and resource binding. By leveraging Dapr in a document processing pipeline, each stage of the process becomes a separate Dapr application, capable of independently scaling and recovering from failures.

Typical workflow

Consider a workflow designed to ingest, process, and index documents for search. The process involves several stages, from extracting text from documents to enriching the content with metadata, generating embeddings, and ultimately indexing the documents for search. With Dapr, each of these stages can be implemented as a separate microservice, communicating through Dapr's pub/sub messaging and utilizing shared state management and resource bindings.

The workflow typically includes the following stages:

Batcher: Initiates the process by listing documents from a storage service and triggering the document processing pipeline.
Process Document: Extracts text from documents using OCR or other text extraction tools and splits the content into manageable chunks.
Generate Embeddings: Converts text chunks into vector representations using machine learning models, facilitating efficient search and similarity comparisons.
Generate Keyphrases and Summaries: Enriches the content with key phrases and summaries, enhancing the searchability and relevance of the documents.
Indexing: Once all enrichments are complete, the processed documents are added to a search index, making them searchable.

Typical Dapr-approach for distributing search ingest

To summarize the flow:

Batcher is triggered by an HTTP request, retrieving every single document in the given blob path and adding these to a queue.
ProcessDocument is subscribed to this queue, pulling the raw document from blob and extracting its content using Form Recognizer/Document Intelligence, splitting it up into multiple manageable sections. Each section is added to 3 enrichment queues (GenerateEmbeddings, GenerateKeyphrases, GenerateSummary).
These 3 enrichment queues are processed in parallel, triggered with a reference to the section that's pulled from the statestore (e.g. Redis), enriched using an external service (e.g. OpenAI embedding, Azure Language API) with the enrichment stored back into the statestore and triggering an EnrichmentComplete .
Once all enrichments are captured for a single section, the section is stored into Blob for indexing and DocumentCompleted is triggered to notify the section is finished.
Similarly, once all sections for a single Document are processed, BatchCompleted is triggered to notify the Document is fully processed.
Once BatchCompleted has been triggered for all Documents that needed processing in the pipeline, Azure Search Indexer is started, pulling all sections from Blob to populate the search index.

This GitHub repository can serve as inspiration for implementing this flow, including scripts to deploy the infrastructure, local development and deployment scripts for deploying these services to Kubernetes.

Conclusion

Adopting Dapr for your search ingest pipeline can be a game-changer. It offers significant advantages in scalability, resiliency, and maintainability, making it a strategic investment for future-proofing your applications. While it introduces some complexity and overhead, the benefits of a microservices-oriented architecture, particularly in a Kubernetes environment, far outweigh these trade-offs.

Splitting each service into its own Dapr container provides several key advantages:

Granular Control: You can set specific rate limits for each external service, ensuring that no single service becomes a bottleneck.
Retry Mechanisms: By breaking down the ingestion flow into smaller, independent pieces, you can easily retry only the failed service without having to reprocess the entire workflow. This makes the system more efficient and resilient to errors.

For more details and to access the deployment scripts, visit the search-ingest-dapr GitHub repository.