What this is
- EUCAIM aims to address the fragmentation of medical imaging data for AI-driven oncology research in Europe.
- It provides a centralized and federated infrastructure to enable standardized data sharing while ensuring compliance with and other regulations.
- The publication details the integration of imaging and clinical data from a reference hospital into the EUCAIM framework, highlighting procedural and ethical challenges.
Essence
- EUCAIM establishes a comprehensive infrastructure for the integration and sharing of oncological imaging data across Europe, enhancing AI-driven research while ensuring regulatory compliance.
Key takeaways
- EUCAIM integrates over 12,000 medical imaging studies from various projects, enhancing the availability of standardized datasets for AI applications in oncology.
- The achieved a processing efficiency of 98.6%, ensuring minimal data loss and compliance with , which is crucial for maintaining clinical relevance.
- The project emphasizes the importance of ethical and legal compliance, with a multi-step anonymization process that safeguards patient data throughout the integration process.
Caveats
- Challenges remain in achieving interoperability across diverse imaging formats and legacy systems, which may hinder data integration efforts.
- Ethics approvals often require significant revisions, complicating the data sharing process and potentially delaying project timelines.
Definitions
- GDPR: General Data Protection Regulation, a legal framework that sets guidelines for the collection and processing of personal information in the EU.
- ETL pipeline: Extraction, Transformation, and Loading pipeline, a process used to prepare data for analysis by extracting it from sources, transforming it into a suitable format, and loading it into a target database.
AI simplified
Introduction
Access to secondary use of medical imaging data for AI-driven research are hindered by fragmentation, lack of interoperability, and complex regulations across Europe regarding privacy and security issues. Although many research initiatives have created imaging repositories, these are often project-specific, temporally limited, and constrained in their potential for reuse. This is especially challenging in oncology, where AI tools for precise diagnosis, prognosis and treatment estimations require large, standardized and annotated datasets to reach full clinical applicability. – 1 3 , 4 5
Multiple initiatives have previously generated valuable imaging repositories, examples include some Horizon Europe projects providing temporal available datasets such as Predictive In-silico Multiscale Analytics to support cancer personalized diagnosis and prognosis, empowered by imaging biomarkers - PRIMAGE, Accelerating the lab to market transition of AI tools for cancer management - CHAIMELEON, Novel pan-European imaging platform for artificial intelligence advances in oncology - EuCanImage, An AI Platform integrating imaging data and models, supporting precision care through prostate cancer's continuum - ProCancer-I, and A multimodal AI-based toolbox and an interoperable health imaging repository for the empowerment of imaging analysis related to the diagnosis, prediction and follow-up of cancer - INCISIVE. And also some existing repositories such as(TCIA), which provides curated datasets but largely centred on disease-specific collections;, where datasets are created to support competitions and are rarely updated or expanded once each challenge is completed; and domain-focused initiatives such asin radiotherapy, which target specific clinical use cases and remain confined to their original scope. These projects have demonstrated the scientific value of shared imaging data, but they lack the harmonised governance, long-term sustainability mechanisms, and structured clinical context that are required for large-scale, reproducible AI research. EUCAIM complements these existing resources since it provides a hybrid federated–centralised infrastructure with common standards for anonymisation, metadata, clinical linkage and data quality, enabling continuous expansion of the atlas and supporting cross-border and GDPR compliant analysis at scale The Cancer Imaging Archive GrandChallenges DESIRE
Theaddresses these challenges by developing a large-scale, secure, and federated digital infrastructure for medical imaging and related clinical data. EUCAIM accelerates AI research in personalized medicine, offering tools that support clinical decision-making in complex settings. EUCAIM builds on thenetwork, which includes five major Horizon Europe projects (PRIMAGE, CHAIMELEON, EuCanImage, ProCancer-I, and INCISIVE). With over 90 partners and more than 180 stakeholders, EUCAIM is creating a GDPR-compliant, sustainable ecosystem for AI-based cancer research. European Federation for Cancer Images (EUCAIM) project AI for Health Imaging (AI4HI) 6 , 7 8
While EUCAIM offers a unique opportunity to harmonize and share annotated oncological imaging data, its success depends on overcoming technical barriers (standardization, transformation, anonymization, curation), strict compliance with the General Data Protection Regulation (), the European Health Data Space Regulation (), and the(AI Act), and with a clear governance frameworks. The EUCAIM platform aims to host over 60 million images from more than 100,000 patients, transforming medical data use in personalized actionable care. GDPR EHDSR Artificial Intelligence Act 9 10
EUCAIM's hybrid model allows Data Providers to transfer data to a reference node or set up their own federated node. This paper analyses the use case of real-world data transfer from a referral hospital, detailing workflows and practical solutions to secure legal compliant integration into EU-wide repositories.
Material and methods
Data integration in EUCAIM follows either a(DSA), where data stays within the institution via a federated node; or a(DTA), where data is physically moved to a reference node which comprises 10 Gigabyte R283-ZF0-AAL1 nodes, each equipped with two AMD EPYC 9474F 48-core processors (a total of 960 cores), 7.68 TB of RAM, 15 NVIDIA A30 and 10 NVIDIA L40S GPU accelerators with 48 GB of RAM each, as well as an additional storage server with 16 TB of NVMe SSD disks connected to the nodes via dual 25 GbE links. In the experiments performed in previous works, this capacity is sufficient to deal with a workload of over 30 concurrent users. Data Sharing Agreement Data Transfer Agreement 11
For the retrospective imaging data included in this study, no informed consent was required, an exemption of informed consent was submitted to and approved by the Ethics Committee. All these nodes are core to EUCAIM's infrastructure.
The choice between a DSA and a DTA reflects the heterogeneous technical capabilities and governance preferences of participating hospitals. Some institutions require that data processing remains fully under their control; for these cases, EUCAIM enables the deployment of a federated node within the hospital infrastructure, hosted on local servers and integrated with the EUCAIM platform through secure, encrypted channels. This environment operates as a Secure Processing Environment under the EHDS framework, allowing authorised users to run analytics and train models locally without any image data leaving the institution. EUCAIM provides technical specifications, containerised services, and remote support to assist hospitals in the installation and configuration of these nodes, although the hardware is typically procured and maintained by the institution according to its internal policies. Other hospitals opt for a DTA, transferring data to the EUCAIM reference node where harmonisation, storage and computation are centrally managed. While models trained within this environment may be exported subject to Access Committee approval, raw imaging data are never downloadable to external servers. This dual approach offers flexibility for centres with different resources and legal constraints while ensuring that all data, whether local or centralised, can be incorporated into cross-site analyses through a unified and privacy-preserving architecture.
Data extraction and exposure pipeline
To ensure standardized, de-identified, and compliant data integration into the EUCAIM infrastructure, a structured Extraction, Transformation, and Loading (ETL) pipeline was defined, incorporating best practices in data governance, security, and regulatory adherence. Imaging data were extracted from the hospital Picture Archiving and Communication System (PACS) through specific nodes in Digital Imaging and Communication in Medicine (DICOM) format according to each project's inclusion criteria. Corresponding clinical data were retrieved from the Electronic Health Record (EHR) system to maintain proper alignment.
The data preparation workflow includes a two-stage privacy-preserving process:
1. Local pseudonymisation:
Pseudonymisation is performed on a restricted-access Virtual Machine by authorized personnel of the Experimental Radiology and Imaging Biomarkers Platform (PREBI) using an in-house tool developed by the Biomedical Imaging Research Group (GIBI230). The process includes: replacement of PatientID, PatientName and AccessionNumber with specific pseudonym and hashes using Blake2b, renaming of folder structures to remove personally identifiable information (PII), removal of public and private DICOM metadata that may contain PII, exclusion of screenshots (ImageType = SCREEN SAVE) and manual review of secondary captures (DERIVED or SECONDARY). This step ensures consistent pseudonymised identifiers for repeated submissions of the same patient at the hospital level. The mapping between original and pseudonymised identifiers is maintained only temporarily by the hospital IT service and is not accessible externally.
2. EUCAIM anonymization:
Once pseudonymised data are transferred to the Pseudonymised Medical Imaging Repository, theis applied. This pipeline enforces theand generates a unique hash per patient, linked to the project and site, with no retained traceability. Sensitive DICOM headers are removed, and OCR-based pixel-level text detection can be applied to eliminate "burned-in" personal information.can be performed to prevent repeated inclusion of the same images across different projects. EUCAIM anonymization pipeline EUCAIM DICOM Anonymization Profile File integrity checks and pixel-level duplicate detection
This two-layer anonymization approach mitigates re-identification risks, aligning with procedures recommended by theand(). Spanish Data Protection Agency best practices from the Singaporean authority Figure 1
Additionally, after transferring it to Pseudonymised Medical Imaging Repository and anonymising the datasets, the DICOM File Integrity Checker developed by our group was used to detect corrupted or missing files. Data were finally ingested into EUCAIM by transferring them to the Reference Node through the, or by sharing them through a federated node. For data standardization, all DICOM files andwere validated for compliance with EUCAIM's structure and. QP-Insights API metadata interoperability standards
The data preparation process and the tools involved are defined in the. Additionally, the metadata of all datasets were registered in the, which follows the Health DCAT-AP standard and ensures compliance with FAIR principles at the dataset level. An additional layer of dataset discoverability is provided to EUCAIM Data Users through the, which enables them to perform queries based on specific criteria and retrieve the number of cases (as aggregated numerical results) that meet those criteria EUCAIM Handbook EUCAIM Public Catalogue Federated Query tool
Theprovides an open-source, fully integrated Data Lake, Registry, and Secure Virtual Research Environment, backed by dedicated computing resources. Its API enables the ingestion while logging all transfers for traceability and auditability. Federated nodes are fully integrated within the EUCAIM platform and meet the EHDSR requirements for Secure Processing Environments (SPEs). SPEs ensure that data processing stays local, under the health data holder's continuous control, significantly reducing transfer risks. Specialized mediation software enforces strict anonymization rules, restricts user actions, and logs all activities to prevent misuse and ensure full accountability. EUCAIM Federated Node
By leveraging SPEs and federated nodes, EUCAIM provides a robust, compliant, and auditable framework for the secure secondary use of health data. This rigorous ETL pipeline ensures scalability, data quality through the access to tools and workflows within EUCAIM, and legal compliance, positioning EUCAIM's reference and federated nodes as the backbone for integrating data from research projects, clinical trials, and future initiatives, in line with European regulations such as theand. This strategy guarantees EUCAIM's long-term sustainability and impact on AI research in medicine. 12 13 Data Act Data Governance Act

Steps for anonymization. Source: Adapted from AEPD. Guide to basic anonymization. Prepared by the National Data Protection Authority of Singapore (PDPC - Personal Data Protection Commission Singapore.
Documentation framework for data integration to EUCAIM
Ensuring compliant data integration within EUCAIM requires strict adherence to clear documentation protocols. At its current stage, decisions on the final integration of a dataset and/or federated node are verified by an Access Committee (AC). During this phase, it is mandatory to provide documentary evidence that guarantees accountability in EUCAIM's operation and demonstrates the application reliability.
For data transfers to EUCAIM, the following documents are considered to safeguard data integrity and regulatory compliance previous to the signature of the DTA:
For Data Sharing Agreements (DSAs) via federated nodes, the same requirements apply, with an additional Security Report or certification (e.g., ISO 27001) document verifying that data sharing meets EUCAIM's security standards for encryption, user authentication, and access control.
All documentation must be verifiable and may undergo audits. Federated nodes must show secure operations and technical interoperability with the EUCAIM platform. Likewise, datasets claimed to be anonymized will be checked: any re-identification risk means the data will be returned for further processing until fully compliant. This framework guarantees that all partners contribute data responsibly, supporting secure, transparent, and efficient data sharing within the EHDSR.
The relevant legal documentation can be summarized in Figure 2

Summary of the relevant legal documentation in EUCAIM.
Use case of data transfer from completed research projects
The PRIMAGE project (GA: 826494) demonstrated how historical datasets could be successfully transferred and validated within the EUCAIM reference node. The main steps included:
Use case of data transfer from oncology clinical trials
The previous approach was also applied to ongoing clinical trials, expanding EUCAIM's reference node cases. Additional challenges emerged, particularly regarding data ownership and sponsor agreements. Key strategies included:
Use case of data sharing from ongoing research projects
The CHAIMELEON project (GA: 952172) illustrated a continuous dataset sharing process without disrupting ongoing research, serving as a model for federated nodes. For these projects, a more dynamic workflow was required due to evolving data collection and regulatory requirements:
CHAIMELEON reached a high maturity level by preparing a robust DSA, anonymizing data, implementing a secure node, and completing a DPIA, setting a strong precedent for future compliant transfers.
Ethical and legal considerations
Each use case prioritizes full compliance with data protection regulations, especially GDPR. All data transferring and sharing activities were approved by the relevant ethics committees, ensuring confidentiality, integrity, and security. This rigorous approach has enabled the secure transfer and sharing of large volume of anonymized data, contributing to advanced AI research. Within EUCAIM, data use is strictly limited to users with approved research projects that meet institutional ethics standards and follow the Access Committee's protocols.
The recent approval of the EHDSR marks a significant milestone for EUCAIM and similar initiatives, providing a harmonized framework for secure access, sharing, and secondary use of large-scale medical datasets. This framework sets clear rules for research and innovation, supported by strict requirements for security, privacy, and informed consent, fostering an interoperable and sustainable data ecosystem. EUCAIM's activities have been progressively aligned with this evolving landscape, positioning as a pilot platform under. Its formal integration into the EHDS ecosystem will strengthen data accessibility and protection, reinforcing the project's long-term sustainability and scientific value, as data volume and diversity grow. HealthData@EU
EUCAIM addresses sustainability through the establishment of a European Digital Infrastructure (EDIC) to enable multi-country investments in large-scale projects. The proposed model combines national and node contributions with European funding to ensure long-term operation of the Central Hub and integration of new partners.
Results
The implementation of a dedicated, harmonized pipeline for data transfer and sharing enabled the successful ingestion of cancer imaging data into the EUCAIM infrastructure. This process established a replicable and scalable workflow suitable for future large-scale federated data integration.
In total, 12,484 medical imaging studies at our hospital and research institute were identified, reviewed, and prepared for integration into EUCAIM, covering both research datasets and real-world clinical trial data. Of these, 10,892 studies (87.24%) from 6,105 patients have been fully processed and validated for upload through either a reference or federated node approach. Our institution has contributed to several initiatives, including the completed PRIMAGE project with 878 studies, the ongoing CHAIMELEON project with 5,346 studies, and a series of Oncology Clinical Trials comprising 4,668 studies. The datasets include multiple modalities ((Computed Tomography - CT, Magnetic Resonance - MR, mammography, Positron Emission Tomography-Computed Tomography - PET-CT)) and represent a broad spectrum of cancer cases and patient populations (). Table 1
The ETL pipeline achieved a processing efficiency of 98.6%, with minimal data loss. Automated DICOM anonymization, combined with manual checks, ensured GDPR compliance while maintaining clinical relevance. PRIMAGE delivered one of EUCAIM's largest structured paediatric oncology datasets, setting a benchmark for rare disease integration. CHAIMELEON validated a stepwise model for harmonized data sharing, and the inclusion of ongoing clinical trial data demonstrated the feasibility of incorporating prospective datasets into the federated infrastructure.
Several challenges emerged that required refinements to optimize efficiency, scalability, and compliance. A key technical challenge was achieving interoperability across diverse imaging formats and legacy PACS metadata, which often required extensive transformation and custom mapping. Variability in clinical data mappings also highlighted the need for more automated and standardized processes.
Administrative and legal aspects added complexity. Ethics approvals often needed significant revisions depending on the project type, and requests for informed consent exemptions required careful justification and additional documentation. These barriers underscored the lesson learned: the EHDS will require organisations to renew their ethical and legal governance models to ensure predictable, efficient data sharing, including data holders, research infrastructures, and users.
Scalability was initially hindered by reliance on manual anonymization checks, slowing processing and introducing inconsistencies. Introducing automated validation tools and batch workflows significantly improved efficiency, though manual review remains necessary for nuanced cases like secondary capture images.
These insights guided refinements to our protocols. Automated metadata validation minimized errors, while standardized ethics templates helped accelerate approval timelines. Experience gained across projects emphasized the importance of early alignment with EUCAIM requirements, ensuring technical, compliance, and governance aspects are clear from the outset.
Despite initial hurdles, this process strengthened our capacity to securely transfer oncological imaging data with the highest standards of security, privacy, and interoperability, providing a solid foundation for future data-sharing efforts within EUCAIM and similar large-scale initiatives.
| Acronym | Project status | Number of identifiedimaging studies | Number of preparedimaging studies | Node |
|---|---|---|---|---|
| PRIMAGE | Completed | 878 | 878 | Reference |
| CHAIMELEON | Ongoing | 5,346 | 5,346 | Federated |
| Clinical Trials | Completed andongoing | 6,260 | 4,668 | Reference |
Discussion
EUCAIM is established as a landmark initiative in oncological imaging, offering Europe's first large-scale hybrid infrastructure for the secondary use of imaging, clinical, and molecular data. By integrating datasets from completed research, ongoing projects, and clinical trials, EUCAIM ensures alignment with the EHDSR and GDPR, thereby supporting the full potential of artificial intelligence in advancing predictive and personalized cancer care. Nevertheless, significant challenges remain, particularly data fragmentation, legacy system integration, and cross-border regulatory complexity.
A key barrier to building a unified repository has been the variability in data formats, metadata, and institutional governance practices. EUCAIM addresses this challenge through structured documentation frameworks and standardized ETL pipelines, which enhance data governance and promote interoperability across systems. Even so, harmonizing legacy PACS data and oncological metadata will continue to demand refinements and shared best practices.
From the outset, ethical and legal considerations have been central. Multi-step anonymization, robust audit logs, and clear governance models ensure compliance and traceability. However, achieving consistent ethics approvals and seamless cross-border data sharing still requires effort. The EHDS framework will be instrumental in easing these processes, but only if health data holders, research infrastructures, and users renew and align their governance models.
The lessons learned from large-scale data integration reinforce the value of early legal and technical alignment. PRIMAGE, CHAIMELEON and ongoing clinical trials each illustrate how proactive regulatory planning and clear data management frameworks enable scalable, compliant data sharing.
Long-term sustainability will rest on strict adherence to Findable, Accessible, Interoperable and Reusable (FAIR) principles, supported by reference and federated nodes, FAIR Data APIs, and persistent identifiers that guarantee data findability and reusability. Automating compliance checks and deploying secure cloud-based environments will be essential to handle ever-growing data volumes, while standardized metadata schemas will further support seamless reuse.
Through the continuously refinement of its governance strategies and proactive alignment with the evolving regulatory landscape under the EHDS, EUCAIM is positioned to become a cornerstone for AI-driven oncology research. Its robust infrastructure not only accelerates the clinical translation of AI tools but also strengthens a culture of trustworthy, collaborative data sharing across Europe, reinforcing EUCAIM's pivotal role in shaping the future of cancer imaging and precision medicine. Central to this effort is the essential contribution of data providers, whose engagement ensures the availability, diversity, and quality of datasets reinforcing EUCAIM in shaping the future of precision medicine.
Ethical approval statement
This manuscript has been approved by the Research Ethics Committee on Medicinal Products of La Fe University and Polytechnic Hospital under registration number 2022-437-1. For the retrospective imaging data included in this study, no informed consent was required, an exemption of informed consent was submitted to and approved by the Ethics Committee.