Deal with huggingface dataset cache

1 minute read

Published:

A good document is necessary.

Problem I was facing

I was encountering OSError: [Errno 28] No space left on device when running my customized tokenize function through a ~18Mb json dataset. And I was using Accelerate on 8 A100 (40GB).

Here’s a demo code for it:

raw_datasets = load_dataset('json', data_files=cache_dir)
if "validation" not in raw_datasets.keys():
raw_datasets["validation"] = load_dataset(
    'json', 
    data_files=cache_dir,
    split=f"train[:{args.validation_split_percentage}%]",
)
raw_datasets["train"] = load_dataset(
    'json', 
    data_files=cache_dir,
    split=f"train[{args.validation_split_percentage}%:]",
)

# after loading tokenizer
def my_own_tokenize_function(examples):
    pass
with accelerator.main_process_first():
    train_dataset = raw_datasets["train"]
    eval_dataset = raw_datasets["validation"]
    train_dataset = train_dataset.map(  # BOOOOOOM!!!!!
        my_own_tokenize_function,
        batched=True,
        num_proc=args.preprocessing_num_workers,
        load_from_cache_file=not args.overwrite_cache,
        desc="Running tokenizer on train dataset",
        batch_size=1,
        writer_batch_size=1,
    )
    eval_dataset = eval_dataset.map(
        my_own_tokenize_function,
        batched=True,
        num_proc=args.preprocessing_num_workers,
        load_from_cache_file=not args.overwrite_cache,
        desc="Running tokenizer on eval dataset",
        batch_size=1,
        writer_batch_size=1,
    )

Way to find the solution

It took me some time to find where exactly huggingface datasets stores its cache.

At first, I thought simply setting environment variable HF_DATASETS_CACHE to my specific path should solve this. But it didn’t work. This variable just set where my json file (raw dataset before my tokenize function) should be cached.

And after I checked /tmp on my server, I realized that datasets stored processed datasets as .arrow files into /tmp/hf_datasets-* if cache_file_name=None, which is really really big for my server’s /tmp storage.

Then I tried adding keep_in_memory=True, but it didn’t work as well since I was using 8 processes. If I do this, my server’s memory will store the tokenized dataset for 8 times and will be filled up, resulting in error.

Finally, after I checked the document of datasets, where I found this in .map() function:

  • cache_file_name (str, optional, defaults to None) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.

I know how to tackle this problem.

Final Solution

So, my solution is setting cache_file_name to a larger path with more disk space without using keep_in_memory=True. This can also save the process time for next time.

Remember update cache files when changing raw datasets.