Design and Practice of Long-Term Sequential Grid Data Management Platform
Article
Figures
Metrics
Preview PDF
Reference
Related
Cited by
Materials
Abstract:
With the rapid development of numerical weather prediction services, the resolution and forecasting lead time of meteorological models have significantly improved, leading to an exponential growth in the volume of forecast data output. As a national meteorological model research and operational centre, CMA Earth System Modeling and Prediction Center (CEMC) currently produces daily gridded data outputs of 0.76 TB, with an annual output reaching 155.12 TB. Given the enormous data volumes, researchers’ preferences for data access are evolving. Wagemann predicts that future scientific users increasingly prefer cloud platforms or other interfaces for data access rather than solely relying on downloads. To address these issues, this paper proposes a lightweight distributed parallel processing framework for gridded data management, aiming to streamline data management processes and enhance data access speed. The core design philosophy revolves around leveraging search engine technology for rapid metadata retrieval and gridded data decoding techniques for efficient data acquisition. To mitigate performance penalties from repetitive decoding, the framework decodes gridded data files once and supports multiple retrievals and extractions, significantly accelerating data access. Additionally, it supports cross-platform data access, facilitating easier data acquisition for researchers. The framework adopts a three-tier architecture: the data layer stores data, the algorithm layer implements core search and cataloguing algorithms, and the business layer interfaces directly with user needs. The framework implements crucial functions such as gridded data cataloguing, extraction, and clipping. During cataloguing, users invoke the cataloguing interface and input parameters (e.g., original data file paths, index names, index types), and the system automatically parses file metadata and generates indexes. For data extraction, users call the retrieval interface with specific parameters to obtain designated data. Moreover, the framework supports precise extraction of specified latitudinal and longitudinal data segments by configuring cropping parameters. It reduces decoding time by creating indexes based on binary storage characteristics, utilises an inverted index value-id model for rapid data location retrieval, enhances processing performance through GlusterFS shared storage and Celery distributed message queues, and ensures efficient and stable data transmission using gRPC technology for C/S communication. Practical tests and applications demonstrate the framework’s exceptional performance in handling massive meteorological data. Notably, it successfully processes petabyte-scale gridded data during the Beijing Winter Olympics meteorological support services, significantly improving data access efficiency. Additionally, the framework supports flexible processing and scalable upgrades for various file formats to meet diverse user needs. By integrating advanced search engine technology, gridded data decoding methods, and a distributed cluster framework, the platform not only enables rapid data retrieval and efficient access but also satisfies researchers’ urgent demand for cross-platform data access. As meteorological data continues to grow, this platform holds significant potential to play a pivotal role in various fields, offering more robust data support for weather forecasting, scientific research, and operational applications.