Generate arbitrary directories / files based on file count and depth

I would like to profile some VCS software, and for that I want to generate a set of random files in randomly ordered directories. I'm writing a script in Python, but my question is brief: How do I create a random directory tree with an average number of subdirectories in a directory and some wide spread of files in a directory?

Clarification: I am not comparing different VCS repo formats (e.g. SVN vs Git vs Hg) but profiling software that handles SVN (and eventually other) working copies and repositories.

The limits I need is to specify the total number of files (call it "N", probably ~ 10k-100k) and the maximum depth of the directory structure ("L", probably 2-10). I don't care how many directories are created at each level, and I don't want to end up with 1 file per directory, or 100k in one directory.

Spreading is something I'm not sure about as I don't know if VCS (in particular SVN) will perform better or worse with a very uniform structure or a very skewed structure. However, it would be nice if I could come up with an algorithm that doesn't "flatten out" for large numbers.

My first thoughts were: generate a directory tree using some method and then populate the tree evenly with files (treat each director equally, regardless of nesting). I calculated that if there are "L" levels, with "D" subdirectories for each directory, and near sqrt (N) files for each directory, then there will be about D ^ L dirs, so N = ~ sqrt (N) * (D ^ L) => D = ~ N ^ (1 / 2L). So now I have an approximate value for "D", how can I generate a tree? How do I fill in the files?

I would appreciate just some pointers to good resources on algorithms that I could use. Only applets / flash found in my search.

+2


source to share


3 answers


Why not download some real open source repositories and use them?



Have you ever wondered what is included in the files? is this also random data?

+4


source


Your question is quite long and involved, but I think it boils down to asking for a random number generator with certain statistical properties.

If you don't like python's random number generator, you can look at some of the other statistical packages on pypi, or if you want something a little heavier, perhaps the python bindings for the GNU Scientific Library.



http://sourceforge.net/projects/pygsl/

http://www.gnu.org/software/gsl/

0


source


I recently wrote a small python package randomfiletree

that generates a random file / directory structure. The code and manual are at https://github.com/klieret/randomfiletree .

The algorithm traverses the existing file tree and creates the number of files and directories in each subfolder based on a Gaussian with a specific width and expected value. This process is then repeated.

He basically uses something like this:

def create_random_tree(basedir, nfiles=2, nfolders=1, repeat=1,
                       maxdepth=None, sigma_folders=1, sigma_files=1):
    """
    Create a random set of files and folders by repeatedly walking through the
    current tree and creating random files or subfolders (the number of files
    and folders created is chosen from a Gaussian distribution).

    Args:
        basedir: Directory to create files and folders in
        nfiles: Average number of files to create
        nfolders: Average number of folders to create
        repeat: Walk this often through the directory tree to create new
            subdirectories and files
        maxdepth: Maximum depth to descend into current file tree. If None,
            infinity.
        sigma_folders: Spread of number of folders
        sigma_files: Spread of number of files
    Returns:
       (List of dirs, List of files), all as pathlib.Path objects.
    """
    alldirs = []
    allfiles = []
    for i in range(repeat):
        for root, dirs, files in os.walk(str(basedir)):
            for _ in range(int(random.gauss(nfolders, sigma_folders))):
                p = Path(root) / random_string()
                p.mkdir(exist_ok=True)
                alldirs.append(p)
            for _ in range(int(random.gauss(nfiles, sigma_files))):
                p = Path(root) / random_string()
                p.touch(exist_ok=True)
                allfiles.append(p)
            depth = os.path.relpath(root, str(basedir)).count(os.sep)
            if maxdepth and depth >= maxdepth - 1:
                del dirs[:]
    alldirs = list(set(alldirs))
    allfiles = list(set(allfiles))
    return alldirs, allfiles


      

This is a pretty quick and dirty approach, but you can also develop this module if there is interest.

0


source







All Articles