Skip to main content

Dataset Detection on File Systems

Introduction

The purpose of this document is to detail the rules applied in File System-type Zeenea connector to detect Datasets.

The algorithm checks all the objects from the root path and identifies if they are datasets. Once a folder is identified as a dataset, the algorithm stops its treatment and goes to the next folder.

Folder Containing Only Files

Rule 1

A folder is a dataset when it only contains files and when at least one file has a supported extension.

Supported extensions are: csv, parquet, orc, xml, json, avro

* The "Client" folder is a dataset (rule 1).
* The folder contains only files, and at least one of them has a supported extension.
* Zeenea will extract the Schema from the most recent file (in this case, Client20190827.csv).

* The "Project" folder is considered a unique dataset (rule 1).
* The folder contains only files, and at least one of them has a supported extension.
* Zeenea will extract the schema from the most recent file.
* If the files are not homogenous, the documented schema may then change upon the analysis treatment.

Folder with Subfolders

Rule 2

When a folder contains a sub-folder, whose name doesn't follow partition naming conventions, that folder is not a dataset.

Rule 3

A file may be a dataset when it is within a folder containing sub-folders. The file must however possess a supported extension.

* The "Client" folder is not a dataset (rule 2).
* It contains subfolders PP and PM, whose names do not follow dataset naming convention.
* Sub-folders PP and PM are, however, datasets (rule 1). They only contain files.


* The "Client" folder is not a dataset (rule 2).
* It contains subfolder PP, whose name does not follow dataset naming convention.
* The "PP" folder is a dataset (rule 1). It only contains files.
Files "Client20190225.csv" and "Client20190226.csv" are datasets (rule 3).
* Files with a supported extension.

Folder with Partitions

Rule 4

A folder is a dataset when:

  • All its subfolders' names follow the partition naming convention.
  • At least one of its subfolders would be considered as a dataset had it been isolated.
* The "Client" folder is a dataset (rule 4).
* "Client" contains subfolders "2019" and "2018", whose names follow the dataset naming convention.
* "2019" would have been a dataset, had it not been contained within "Client" (rule 3).
* "2019" contains subfolders "05" and "08", whose names follow the dataset naming convention.
* "08" would have been a dataset, had it not been contained within "2019"
* "08" only contains files, of which at least has a supported extension.
* The "Client" Folder is not a dataset (rule 2).
* It contains subfolders "PP" and "2019", whose names do not follow the dataset naming convention.
* Sub-Folder 2019 is a dataset (rule 4). Its subfolder would be considered a dataset had it been isolated.
* Sub-folder "PP" is a dataset (rule 1). It only contains files.





Partition Naming Convention

Subfolders are considered partitions if their names match this regular expression:

"(.*=.*)", "[0-9]{8}", "[0-9]{4}", "[0-9]{2}", "0?[1-9]|1[012]", "0?[1-9]|1[0-9]|2[0-9]|3[0-1]"