
LLM large model training data (Hong Kong social comprehensive category 900G)
HK$39,999.00
Product Name:
Hong Kong Social Comprehensive Dataset (1850–2024)
Overview:
This dataset is a carefully curated collection covering multiple areas of Hong Kong society, including local news, industry figures, legal system, academic, humanities, and financial data, including text, images, audio, video and other data. The time span is as long as two centuries (1850–2024). It provides rich resources for the training of large language models (LLMs) and AI algorithms, and is suitable for tasks such as text generation, sentiment analysis, and knowledge retrieval.
Data format:
- Text files: structured and unstructured text in .txt, .csv and .json formats for easy integration into the LLM training framework.
- Metadata: Contains metadata such as publication date, author information, and source details in .csv and .json formats.
- Annotations: Pre-annotated datasets for natural language processing tasks including entity recognition and topic classification (in .json or .xml format).
Data collection and source:
This dataset is collected from authoritative sources including:
- News Archive: Local newspapers and media covering political, social and economic events from 1850 to 2024.
- Industry Figures: Biographical data on key figures in various industries in Hong Kong, including business, finance and politics.
- Legal Documents: The latest Hong Kong laws, regulations and government announcements, providing legal and social background information.
- Academic Collection: Academic articles and research reports from Hong Kong universities and think tanks.
- Humanities and cultural data: Humanities texts, art reviews and social trends reflecting the cultural development of Hong Kong.
- Financial Data: Historical and real-time data from the Hong Kong financial center, including stock market indices and economic reports.
Data preprocessing and training methods:
- Pre-processing: Data is rigorously cleansed, normalized, and tokenized to ensure sensitive information is filtered out and privacy regulations are adhered to.
- Training methods: Optimized for the latest LLM architectures such as transformer, GPT, etc. The dataset contains fine-tuning instructions for specific use cases such as chatbot development, summary generation, or sentiment analysis.
- Augmentation Techniques: To improve the robustness of the data, the dataset also includes augmentation techniques such as paraphrasing, synonym replacement, and sentence rearrangement.
Update:
- 2024 Update: The dataset contains the latest data from 2024, ensuring that the models trained with this dataset can reflect the latest legal, economic, and social environment in Hong Kong.
- Continuous Update Support: Regular updates are provided to ensure that the dataset keeps pace with the evolving social landscape of Hong Kong. Updates are available to purchasers through subscription or direct download.
Delivery process:
- Purchase: Users can select this dataset on the platform.
- Payment: Complete the transaction through a secure payment process.
- Delivery: After payment is confirmed, the user will receive a download link or data transfer instructions, and the delivery method will be customized according to the user's storage device.
release date:
September 19, 2024
Update Package:
- Version control: The dataset is released using version control, and update packages for new data are provided.
- Update frequency: Update packages will be released every six months, or upon request from premium subscribers.