Data platform requirements and expectations
By David WORMS
Mar 23, 2023
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
A big data platform is a complex and sophisticated system that enables organizations to store, process, and analyze large volumes of data from a variety of sources.
It is composed of several components that work together in a secured and governed platform. As such, a big data platform must meet a variety of requirements to ensure that it can handle the diverse and evolving needs of the organization.
Note, due to the extensive nature of the domain, it is not feasible to provide a comprehensive and exhaustive list of requirements. We invit you to contact us to share additionnal enhancements.
Data ingestion
This area includes the ingestion of data from various sources, their treatment, and their storage in a suitable format.
-
Data sources
Ability to consume data from various sources including databases, file systems, APIs, and data streams.
-
Ingestion mode
Ability to consume data in both batch and streaming.
-
Data format
Support for reading and writing file formats and table formats such as JSON, CSV, XML, Avro, Parquet, Delta Lake and Iceberg.
-
Data quality
Definition for the quality requirements for the data, such as data completeness, data accuracy, and data consistency, and ensure that the ingestion pipeline can validate and cleanse the data as needed.
-
Transformation des données
Determine whether the data needs to be transformed or enriched before it can be stored or analyzed.
-
Data Availability
Ensure that the ingestion pipeline can handle failures or outages of the data sources or the ingestion pipeline itself, and can recover and resume ingestion without data loss.
-
Volume
Provide solutions capable of addressing expected volume and throughput variations.
Data storage
This area includes the storage, the managment, and the retrieval of large volumes of data.
-
Disponibilité
The ability to access the data reliably and with minimal downtime, ensuring high availability of the data.
-
Durability
The ability to ensure data is not lost due to hardware failures or other errors, with data replication and backup strategies in place.
-
Performance
The ability to store and retrieve data quickly and efficiently, with low latency and high throughput.
-
Elasticity
Storage and management of growing volumes of data, with the ability to scale up and down as needed by acquiring and releasing additional resources.
-
Data lifecycle
Data lifecycle management by applying changes and adding missing data and the possibility of reverting to a previous version.
Data processing in the data lake
This area includes the processes for preparing and exposing the data for further analysis.
-
Flexibility
Ability to support multiple data types and formats and ability to integrate with various distributed data processing and analysis tools.
-
Data cleaning
Cleanse the data to remove or correct errors, inconsistencies, and missing values.
-
Data integration
Combine and integrate multiple data sources into a single dataset, resolving any schema or format differences.
-
Data transformation
Transform the data to prepare it for downstream processing or analysis, such as aggregating, filtering, sorting, or pivoting.
-
Data enrichment
Enhance the data with additional information to provide more context and insights.
-
Data reduction
Reduce the volume of data by summarizing or sampling it, while preserving the essential characteristics and insights.
-
Data normalization and denormalization
Normalize the data to remove redundancies and inconsistencies, ensuring that the data is stored in a consistent format and denormalization to improve performances.
Data observability
This area is the practice of monitoring and managing the quality, integrity, and performance of data as it flows through the platform.
-
Data validation
Ensuring that the data is valid, accurate, and consistent, and meets the expected format and schema.
-
Data lineage
Tracking the path of data as it flows through the system to identify any issues or anomalies.
-
Data quality monitoring
Continuously monitoring the quality of data and raising alerts when anomalies or errors are detected.
-
Performance monitoring
Monitoring the performance of the system, including latency, throughput, and resource utilization, to ensure that the system is performing optimally.
-
Metadata management
Managing the metadata associated with the data, including data schema, data dictionaries, and data catalog, to ensure that it is accurate and up-to-date.
Data usage
This area includes the requirements to access, transfer, analyze and visualize the data to extract insights and actionable information.
-
User interfaces
CLI environments and graphical interfaces available to users for data processing and visualization.
-
Communication Interfaces
Provision of data access via REST, RPC and JDBC/ODBC communication protocols.
-
Data mining
Perform exploratory data analysis to understand data characteristics and quality, extract patterns, relationships, or insights from the data, using statistical or machine learning algorithms.
-
Data access
Ensure that the data is secure and protected from unauthorized access or breaches, by implementing appropriate security controls and protocols.
-
Data Visualization
Visualize the data to communicate insights and findings to stakeholders, using charts, graphs, or other visualizations.
Platform Security and Operation
The area cover the security and the management of a big data platform.
-
Data regulation and compliance
The ability to ensure compliance with data governance policies and regulations, such as data privacy laws, data usage practices, data retention policies, and data access controls.
-
Fine-grained access control
Ability to control access and data sharing on all proposed services with management policies taking into account the characteristics and specificities of each.
-
Data filtering and masking
Filtering of data by row and by column, application of masks on sensitive data.
-
Encryption
Encryption at rest and in transit with SSL/TLS.
-
Integration into the information system
Integration of users and user groups with the corporate directory.
-
Security perimeter
Isolation of the platform in the network and centralize access through a single entry point.
-
Admin interface
Provision of a graphical interface for the configuration and monitoring of services, the management of data access controls and the governance of the platform.
-
Monitoring and alerts
Exposing metrics and alerts that monitor and ensure the health and performance of the various services and applications.
Hardware and maintance
This area covers the acquisition of new resources as well as the maintenance requirements.
-
Targetted infrastructure
Selection between a cloud or an on-premise infrastructure, taking into account that cloud offers flexible and scalable storage and processing of large datasets with cost efficiencies, while on-premise deployment provides greater control, security and compliance over data but requires significant upfront investment and ongoing maintenance costs.
-
Asymmetrical architecture
Dissociation between resources dedicated to storage and processing and, in some circumstances, collocation of processing and data.
-
Storage
Provision of a storage infrastructure in line with the volumes expressed.
-
Compute
Provision of a computing infrastructure capable of evolving with future usages brought by projects and users in the fields of data engineering, data analysis and data science.
-
Cost-effectiveness
The ability to store and manage data cost-effectively, with consideration of the cost of storage and the cost of managing and operating the storage solution.
-
Cost management and total cost of ownership (TCP)
Control and calculation of the total cost of the solution taking into account all the factors and specificities of the platform such as infrastructure, staff, acquisition of licenses, deadlines, use, team turnover, technical debt, …
-
User support
Support for platform users with the aim of ensuring the acquisition of new skills for the teams, the validation of the architecture choices, the deployment of patches and features, and the proper use of the available resources.
Conclusion
Overall, a big data platform must be able to handle the diverse and evolving needs of the organization, while ensuring that the solution is highly flexible, resilient, and performant, that data is secure, compliant, and of high quality, that insights and findings are communicated effectively accross the various stakeholders, and that it remains cost-effective to operate over time.