A table showing the base statistics for four datasets: ATB/no dustbin, no unknown; ATB/Dustbin; ATB/Salmonella Enterica; and ATB/Escherichia Coli. For each dataset, the table provides the size in GB, the number of references, and the number of kmers. The ATB/no dustbin, no unknown dataset is the largest at 130.51 GB with over 1.8 million references, while the ATB/Dustbin dataset has the highest kmer count at over 55 billion.
[Construction] With an experimental pipeline building directly to m-Fulgor, we indexed 98% of the AllTheBacteria v0.2 dataset (all species excluding unknown and dustbin) in 4 days. The resulting index takes just 130GB, for 1.8M genomes.
Figure: basic stats for some indexes. (4/6)