3 | | = GenericIO = |
4 | | |
5 | | GenericIO is a write-optimized library for writing self-describing scientific data files on large-scale parallel file systems. |
6 | | |
7 | | == References == |
8 | | |
9 | | Habib, et al., HACC: Simulating Future Sky Surveys on State-of-the-Art Supercomputing Architectures, New Astronomy, 2015 |
10 | | [http://arxiv.org/abs/1410.2805]. |
11 | | |
12 | | == Source Code == |
13 | | |
14 | | A source archive is available here: [http://www.mcs.anl.gov/~turam/genericio/genericio-20190417.tar.gz genericio-20190417.tar.gz] (previous releases: [http://www.alcf.anl.gov/~hfinkel/genericio/genericio-20170925.tar.gz genericio-20170925.tar.gz] [http://www.alcf.anl.gov/~hfinkel/genericio/genericio-20160829.tar.gz genericio-20160829.tar.gz] [http://www.alcf.anl.gov/~hfinkel/genericio/genericio-20150608.tar.gz genericio-20160412.tar.gz] [http://www.alcf.anl.gov/~hfinkel/genericio/genericio-20150608.tar.gz genericio-20150608.tar.gz]), or from git: |
15 | | |
16 | | {{{ |
17 | | git clone http://git.mcs.anl.gov/genericio.git |
18 | | }}} |
19 | | |
20 | | == Output file partitions (subfiles) == |
21 | | |
22 | | If you're running on an IBM BG/Q supercomputer, then the number of subfiles (partitions) chosen is based on the I/O nodes in an automatic way. Otherwise, by default, the GenericIO library picks the number of subfiles based on a fairly-naive hostname-based hashing scheme. This works reasonably-well on small clusters, but not on larger systems. On a larger system, you might want to set these environmental variables: |
23 | | |
24 | | {{{ |
25 | | GENERICIO_PARTITIONS_USE_NAME=0 |
26 | | GENERICIO_RANK_PARTITIONS=256 |
27 | | }}} |
28 | | |
29 | | Where the number of partitions (256 above) determines the number of subfiles used. If you're using a Lustre file system, for example, an optimal number of files is: |
30 | | |
31 | | # of files * stripe count ~ # OSTs |
32 | | |
33 | | On Titan, for example, there are 1008 OSTs, and a default stripe count of 4, so we use approximately 256 files. |
34 | | |
35 | | == Benchmarks == |
36 | | |
37 | | Once you build the library and associated programs (using make), you can run, for example: |
38 | | |
39 | | {{{ |
40 | | $ mpirun -np 8 ./mpi/GenericIOBenchmarkWrite /tmp/out.gio 123456 2 |
41 | | Wrote 9 variables to /tmp/out (4691036 bytes) in 0.2361s: 18.9484 MB/s |
42 | | }}} |
43 | | |
44 | | {{{ |
45 | | $ mpirun -np 8 ./mpi/GenericIOBenchmarkRead /tmp/out.gio |
46 | | Read 9 variables from /tmp/out (4688028 bytes) in 0.223067s: 20.0426 MB/s [excluding header read] |
47 | | }}} |
48 | | |
49 | | The read benchmark always reads all of the input data. The output benchmark takes two numerical parameters, one if the number of data rows to write, and the second is a random seed (which slightly perturbs the per-rank output sizes, but not by much). Each row is 36 bytes for these benchmarks. |
50 | | |
51 | | The write benchmark can be passed the -c parameter to enable output compression. Both benchmarks take an optional -a parameter to request that homogeneous aggregates (i.e. "float4") be used instead of using separate arrays for each position/velocity component. |
52 | | |
53 | | == Python module == |
54 | | |
55 | | The repository includes a genericio Python module that can read genericio-formatted files and return numpy arrays. This is included in the standard build. To use it, once you've built genericio, you can read genericio data as follows: |
56 | | |
57 | | {{{ |
58 | | $ export PYTHONPATH=${GENERICIO_DIR}/python |
59 | | $ python |
60 | | >>> import genericio |
61 | | >>> genericio.gio_inspect('m000-99.fofproperties') |
62 | | Number of Elements: 1691 |
63 | | [data type] Variable name |
64 | | --------------------------------------------- |
65 | | [i 32] fof_halo_count |
66 | | [i 64] fof_halo_tag |
67 | | [f 32] fof_halo_mass |
68 | | [f 32] fof_halo_mean_x |
69 | | [f 32] fof_halo_mean_y |
70 | | [f 32] fof_halo_mean_z |
71 | | [f 32] fof_halo_mean_vx |
72 | | [f 32] fof_halo_mean_vy |
73 | | [f 32] fof_halo_mean_vz |
74 | | [f 32] fof_halo_vel_disp |
75 | | |
76 | | (i=integer,f=floating point, number bits size) |
77 | | >>> genericio.gio_read('m000-99.fofproperties','fof_halo_mass') |
78 | | array([[ 4.58575588e+13], |
79 | | [ 5.00464689e+13], |
80 | | [ 5.07078771e+12], |
81 | | ..., |
82 | | [ 1.35221006e+13], |
83 | | [ 5.29125710e+12], |
84 | | [ 7.12849857e+12]], dtype=float32) |
85 | | |
86 | | }}} |