
LLM large model training data (Macao social comprehensive data 870G)
HK$39,999.00
Macao Social Comprehensive Dataset (1850–2024)
Overview:
The dataset comprehensively covers all aspects of Macau society, including local news, industry figures, the latest social systems and laws, academic research, cultural humanities, and financial center data, including text, pictures, audio, video and other data. The time span is from 1850 to 2024, suitable for the training of large language models (LLM) and AI algorithms, and supports a variety of natural language processing tasks such as text generation, knowledge question answering, and sentiment analysis.
Data format:
- Text file: The data format is .txt, .csv and .json, supporting structured and unstructured text, which is convenient for importing into the LLM training framework.
- Metadata: Provide detailed metadata, such as source, time, author, etc., in .csv and .json formats.
- Annotated data: Some datasets provide pre-annotated annotations, such as entity recognition and text classification, in .json or .xml format.
Data collection and source:
The datasets are sourced from various authoritative resources in Macau, including:
- News Archives: A collection of local Macau newspapers and news reports from 1850 to 2024, covering major events in the political, social and economic fields.
- Industry Figures: covers biographical data of famous people in Macau from all walks of life, including important figures in the fields of finance, culture, politics, etc.
- Legal Documents: Contains the latest laws and regulations, government announcements and social systems in Macao, providing rich data support for legal and social research.
- Academic Literature: It collects academic papers and research results from Macao, covering multiple disciplines.
- Cultural and humanities data: covers Macau’s cultural heritage, art reviews and social changes, showing Macau’s unique cultural landscape.
- Financial data: including data on the Macau financial center, such as economic reports, market indices, etc., providing a rich foundation for financial research.
Data preprocessing and training methods:
- Preprocessing: The data set is standardized, including text cleaning, deduplication, sensitive information filtering, and other steps to ensure the high quality and compliance of the data.
- Training methods: Optimized for mainstream LLM training frameworks such as transformer, GPT, etc. The data package comes with a fine-tuning guide to support specific applications such as chatbots and summary generation.
- Data augmentation: The dataset is augmented through technical means, such as text paraphrase, synonym replacement, and random sentence ordering, to ensure diversity in model training.
Update:
- Data update to 2024: The dataset contains the latest data to 2024, ensuring that the model captures the latest social, legal, and economic developments in Macau.
- Continuous update support: The data set supports regular updates. Purchasers can obtain the latest patch packages through subscription to ensure the timeliness of the data.
Delivery process:
- Purchase: Users select and purchase data packages on the platform.
- Payment: After completing the payment, the user will receive a download link or data transfer instructions.
- Data delivery: Users can download data to a local storage device to complete data acquisition.
release date:
September 19, 2024
March 19, 2025 (updated to the latest 870G)
Update Package:
- Version control: The dataset version control is clear, and incremental update packages for new data are provided at any time.
- Update frequency: twice a year, or customized update service based on user needs.