Many organizations still publish critical business data through reports available on websites — often as a combination of HTML content and attachments such as PDFs, scanned paper documents, XML files, or DOCX documents. The challenge begins when this data needs to be automatically collected, classified, and transformed into a structured format suitable for analytics and downstream systems.
In one of our GCP-based Data/ML projects, we built a pipeline designed to automatically ingest and process reports published on a client’s website. A key component of the solution was the use of Gemini models available in Gemini Enterprise Agent Platform, previously known as Vertex AI.
Project Goal
The primary objective of the pipeline was to automate the end-to-end processing of reports published on the client’s platform. The solution needed to retrieve website content, analyze attached documents, extract predefined business fields, classify report types, validate extracted data, and finally load the structured results into BigQuery tables. One of the more challenging requirements was support for both digital documents and scanned paper documents stored as PDFs.
Solution Architecture
The solution was built entirely within the Google Cloud ecosystem using services such as BigQuery Studio, Gemini Enterprise Agent Platform, BigQuery, and Cloud Storage. Most of the orchestration and parsing logic was implemented directly in Python notebooks running inside BigQuery Studio, which allowed for rapid experimentation and close integration with analytical datasets.

The ingestion layer was responsible for analyzing HTML pages, extracting metadata, and identifying downloadable attachments linked to reports. The pipeline processed a wide range of document types, including PDFs, XML files, DOCX documents. In many cases, the webpage itself contained only a brief summary, while the actual business-critical information was embedded within attached files.
Processing PDFs and Scanned Documents
One of the most technically demanding aspects of the project involved handling scanned paper documents stored as PDFs. Traditional document-processing architectures often require separate OCR pipelines, image preprocessing stages, text extraction services, and custom parsing heuristics. Maintaining such systems can quickly become complex, especially when document quality varies significantly.
Instead of building a dedicated OCR workflow, we leveraged the multimodal capabilities of Gemini models available through Gemini API. This allowed the same model to process both text-based documents and image-based scans, significantly simplifying the architecture and reducing operational complexity.
Data Extraction with Gemini
The core logic of the pipeline was implemented around Gemini-driven document understanding. The Python notebook retrieved webpage content, selected the most relevant attachments, and passed both the textual context and documents to the Gemini API. The model was instructed not only to analyze the content itself but also to classify the report type and extract predefined business attributes.
To improve reliability and consistency, the prompts included explicit extraction instructions together with structured output requirements. Rather than returning free-form text, Gemini was expected to generate structured responses aligned with predefined schemas.
An important part of the solution was the use of Pydantic models for defining the target data structure. The schemas described fields such as report type, publication date and optional metadata attributes. In practice, this acted as a strict contract between the application layer and the LLM, significantly improving output consistency and simplifying validation.
The combination of Gemini and Pydantic proved especially effective because it reduced ambiguity in the extraction process while enabling automatic type validation and downstream error handling.
Report Classification and Validation
Beyond simple extraction, the pipeline also performed semantic classification of reports. Gemini analyzed the document content and determined the business category of each report, identified relevant sections, and evaluated document relevance. This enabled the system to route extracted records into appropriate BigQuery tables and downstream analytical workflows.
One of the major engineering concerns when working with LLMs in production environments is hallucination handling. Since the extracted values were later consumed by analytical systems, maintaining data quality was critical. Several safeguards were introduced to minimize unreliable outputs:
- enforcing structured JSON responses,
- validating outputs against Pydantic schemas,
- applying additional business validation rules,
- explicitly instructing the model not to invent missing values and returning `null` when information was not present in the source document,
- setting the model temperature parameter to `0.0` to reduce output variability.
This layered validation approach significantly improved the reliability of extracted data.
The extraction process was implemented using the Gemini API with structured JSON responses and schema validation based on Pydantic models:
response = gemini_client.models.generate_content(
model="gemini-2.5-flash",
contents=contents,
config={
"response_mime_type": "application/json",
"response_schema": ReportModel,
"temperature": 0.0
}
)
The `response_schema` parameter ensured that Gemini returned data matching the predefined Pydantic schema, while setting the temperature to `0.0` helped maximize deterministic outputs during document extraction. Additionally, the Pydantic model ReportModel allowed us to define both the expected data types and detailed field descriptions for extracted attributes. This made it possible to explicitly describe what information should be extracted from the document, improving extraction accuracy and reducing ambiguity in the model responses. Using descriptive field definitions inside the Pydantic schema significantly improved the quality and consistency of Gemini extraction results, especially for ambiguous or loosely structured documents.
Loading Data into BigQuery
After validation, the parsed records were loaded into BigQuery tables. The implementation relied primarily on append-based ingestion combined with partitioning strategies based on publication dates and deduplication logic using report identifiers. This structure enabled efficient querying, dashboarding, and integration with downstream ML or reporting systems.
Using BigQuery Studio as the primary development environment substantially accelerated the delivery process. The notebook-driven workflow made it easy to iterate on prompts, experiment with different Gemini models, debug extraction issues, and directly inspect analytical results without switching environments. For Data and ML engineers, this approach was considerably more efficient than building a fully productionized backend platform from the very beginning.
Challenges and Optimization
One of the main difficulties of the project was the heterogeneity of source documents. Reports varied significantly in layout, formatting, terminology, and overall quality. Some files contained structured tables, while others were effectively scanned images of printed documents. A traditional regex-based extraction approach would have been difficult to maintain and highly sensitive to formatting changes.
Another important aspect involved token usage and inference cost optimization. Large attachments could quickly exceed practical context limits or generate unnecessary inference costs. To address this, the pipeline implemented selective page extraction, document chunking, attachment filtering, and response caching strategies. These optimizations reduced token consumption while maintaining extraction quality.
Conclusion
Multimodal models such as Gemini are fundamentally changing how document-processing pipelines are designed within modern Data/ML platforms. Instead of maintaining separate OCR systems, PDF parsers, rule-based extractors, and custom heuristics, organizations can increasingly rely on unified AI-driven architectures capable of handling multiple document formats within a single workflow.
The combination of Gemini, Python notebooks in BigQuery Studio, structured validation with Pydantic, and BigQuery as the analytical backend enabled the rapid development of a scalable and flexible document-processing platform. The solution significantly reduced manual processing effort, accelerated onboarding of new report types, and created a much more maintainable approach to extracting structured business data from unstructured documents.