History

Richard Wong 1f3970459f Chore: re-organized train folders to have standardized naming schemes Feat: introduced BERT-based binary classification		2024-11-20 15:07:47 +09:00
..
abbreviations	Chore: re-organized train folders to have standardized naming schemes	2024-11-20 15:07:47 +09:00
check_data	Feat: added abbreviation expansion rules	2024-11-10 20:28:47 +09:00
exports	Chore: changed ipynb to py files in the data_preprocess folder	2024-10-29 22:55:22 +09:00
no_preprocess	Chore: changed ipynb to py files in the data_preprocess folder	2024-10-29 22:55:22 +09:00
rule_base_replacement	Chore: changed ipynb to py files in the data_preprocess folder	2024-10-29 22:55:22 +09:00
.gitignore	Chore: changed ipynb to py files in the data_preprocess folder	2024-10-29 22:55:22 +09:00
README.md	Chore: changed ipynb to py files in the data_preprocess folder	2024-10-29 22:55:22 +09:00
split_data.py	Feat: added train and test directories	2024-10-31 15:58:20 +09:00

README.md

Data Preprocess

What is this folder

This folder contains the files for pre-processing.

We divide each processing method into their respective folders to modularize the pre-processing methods. This helps to make it easier to test different methods and reduce coupling between stages.

Instructions

First, we apply the pre-processing by running code from the desired folder.

Using no_preprocess directory as an example:

cd no_preprocess
Follow the instructions found in the sub-directory
After code execution, the processed file will be placed into exports/preprocessed_data.csv

We then run the data split code to create our k-fold splits.

cd back to the data_preprocess directory
python split_data.py

You will now have the datasets in exports/dataset/group_{1,2,3,4,5}