Storage Cluster Test

Raspberry Pi Setup

Arch Linux is my preferred distro for the Raspberry Pi and is used in this guide.

Arch Linux Arm: https://archlinuxarm.org/

Repeat the following for all Pi's in the cluster.

Fixed IP

Disable networkd in favor of netctl:
systemctl disable systemd-networkd.service
Copy static IP example into place:
cd /etc/netctl
cp examples/ethernet-static .
Edit ethernet-static to have appropriate fixed IP:
Description='A basic static ethernet connection'
Interface=eth0
Connection=ethernet
IP=static
Address=('192.168.0.71/24')
#Routes=('192.168.0.0/24 via 192.168.1.2')
Gateway='192.168.0.1'
DNS=('192.168.0.1')
Enable ethernet-static config:
netctl enable ethernet-static

Set Hostname

Edit /etc/hostname:
pi01

Overclock the slower Pi B+

Note: holding down the shift key during boot up will disable the overclock for that boot, allowing you to select a lower level.

Edit /boot/config.txt:
arm_freq=1000
sdram_freq=500
core_freq=500
over_voltage=6
temp_limit=75
boot_delay=0
disable_splash=1

gpu_mem=16

Performance Utilities

Read the temperature:
/opt/vc/bin/vcgencmd measure_temp

SD Write Speed Test:
[root@pi01 ~]#  sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 68.0134 s, 7.7 MB/s

real   1m8.033s
user   0m0.001s
sys   0m10.100s

real   0m11.875s
user   0m0.001s
sys   0m0.013s

SD Read Speed Test:
[root@pi01 ~]# dd if=~/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 27.4617 s, 19.1 MB/s

USB Write Speed Test:
mkdir USB
mount /dev/sda1 USB

[root@pi01 ~]# sync; time dd if=/dev/zero of=~/USB/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 93.4846 s, 5.6 MB/s

real   1m33.499s
user   0m0.013s
sys   0m12.158s

real   0m16.322s
user   0m0.013s
sys   0m0.001s

USB Read Speed Test:
[root@pi01 ~]# dd if=~/USB/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 38.2419 s, 13.7 MB/s

Storage Setup

All 3 Pi's will have 4 USB sticks merged and mounted to /storage. The plan is to use a different method to create storage on each Pi.

Setup Storage on pi01 with mdadm

For this we will try using mdadm to create a software raid 10.

Source: https://www.digitalocean.com/community/tutorials/how-to-create-raid-arrays-with-mdadm-on-ubuntu-16-04
Source: http://www.ducea.com/2009/03/08/mdadm-cheat-sheet/

Delete partitions on USB drives:
fdisk /dev/sda
o
w

Identify the Component Devices:
lsblk -o NAME,SIZE,FSTYPE,TYPE,MOUNTPOINT
Output:
[root@pi01 ~]# lsblk -o NAME,SIZE,FSTYPE,TYPE,MOUNTPOINT
NAME         SIZE FSTYPE TYPE MOUNTPOINT
sda         14.8G        disk
sdb         14.8G        disk
sdc         14.8G        disk
sdd         14.8G        disk
mmcblk0      7.4G        disk
|-mmcblk0p1  200M vfat   part /boot
`-mmcblk0p2  7.2G ext4   part /

Create the Array:
mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/sda /dev/sdb /dev/sdc /dev/sdd
Output:
mdadm: layout defaults to n2
mdadm: layout defaults to n2
mdadm: chunk size defaults to 512K
mdadm: partition table exists on /dev/sda
mdadm: partition table exists on /dev/sda but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdb
mdadm: partition table exists on /dev/sdb but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdc
mdadm: partition table exists on /dev/sdc but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdd
mdadm: partition table exists on /dev/sdd but will be lost or
       meaningless after creating array
mdadm: size set to 15454208K
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

The mdadm tool will start to configure the array (it actually uses the recovery process to build the array for performance reasons). This can take some time to complete, but the array can be used during this time. You can monitor the progress of the mirroring by checking the /proc/mdstat file:
cat /proc/mdstat
Output:
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md0 : active raid10 sdd[3] sdc[2] sdb[1] sda[0]
      30908416 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      [>....................]  resync =  3.0% (949376/30908416) finish=62.5min speed=7987K/sec

unused devices: <none>

Create and Mount the Filesystem

Create a filesystem on the array:
mkfs.ext4 -F /dev/md0
Create mount point and mount filesystem:
mkdir /storage
mount /dev/md0 /storage
Show available space:
df -h -x devtmpfs -x tmpfs
Output:
Filesystem      Size  Used Avail Use% Mounted on
/dev/mmcblk0p2  7.0G  1.4G  5.3G  21% /
/dev/mmcblk0p1  200M   25M  176M  13% /boot
/dev/md0         29G   45M   28G   1% /storage

Save the Array Layout

To make sure that the array is reassembled automatically at boot, we will have to adjust the /etc/mdadm/mdadm.conf file. We can automatically scan the active array and append the file by typing:
mdadm --detail --scan | tee -a /etc/mdadm.conf

Afterwards, you can update the initramfs, or initial RAM file system, so that the array will be available during the early boot process:
How to do this with Arch.  Don't think we need for our purposes.

Add the new filesystem mount options to the /etc/fstab file for automatic mounting at boot:
echo '/dev/md0 /storage ext4 defaults,nofail,discard 0 0' | tee -a /etc/fstab

Your RAID 10 array should now automatically be assembled and mounted each boot.

Speed tests with mdadm

RAID 10 Write Speed Test (Looks CPU bound?):
[root@pi01 storage]# time dd if=/dev/zero of=/storage/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 253.617 s, 2.1 MB/s

real   4m14.632s
user   0m0.001s
sys   0m10.434s

real   0m44.117s
user   0m0.000s
sys   0m0.013s

RAID 10 Read Speed Test:
[root@pi01 storage]# dd if=/storage/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 15.0556 s, 34.8 MB/s

Try to break mdadm array

Source: https://raid.wiki.kernel.org/index.php/Detecting,_querying_and_testing

I just plucked out the middle usb stick on a 3 drive RAID 10 and the system continued to function as normal.

Check status of array:
mdadm --detail /dev/md0
Output:
/dev/md0:
        Version : 1.2
  Creation Time : Wed Feb 21 22:03:50 2018
     Raid Level : raid10
     Array Size : 23181312 (22.11 GiB 23.74 GB)
  Used Dev Size : 15454208 (14.74 GiB 15.83 GB)
   Raid Devices : 3
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Wed Feb 21 23:34:48 2018
          State : clean, degraded
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : manjaro-borx01:0  (local to host manjaro-borx01)
           UUID : a4ec3c18:da3ba91b:1e3fb75f:d1e61894
         Events : 53

    Number   Major   Minor   RaidDevice State
       0       8       32        0      active sync   /dev/sdc
       -       0        0        1      removed
       2       8       64        2      active sync   /dev/sde

Drive shows as "removed" so I'll skip the remove step. I plugged in the old drive and ran the following:
[root@manjaro-borx01 ~]# mdadm /dev/md0 -a /dev/sdd
mdadm: added /dev/sdd
Recovery looks like it will take almost as long as the original RAID build.

If I needed to remove the drive first I would do the following:
mdadm /dev/md0 -r /dev/sdd

Setup Storage on pi02 with btrfs

Source: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
Source: https://wiki.archlinux.org/index.php/Btrfs
Source: https://www.howtoforge.com/a-beginners-guide-to-btrfs

Install btrfs user space utilities:
pacman -S btrfs-progs

Use raid10 for both data and metadata:
mkfs.btrfs -m raid10 -d raid10 /dev/sda /dev/sdb /dev/sdc /dev/sdd -f
Output:
btrfs-progs v4.15
See http://btrfs.wiki.kernel.org for more information.

Label:              (null)
UUID:               77fecf7f-a8d3-43ba-89a1-9448baa463a9
Node size:          16384
Sector size:        4096
Filesystem size:    58.98GiB
Block group profiles:
  Data:             RAID10            2.00GiB
  Metadata:         RAID10            2.00GiB
  System:           RAID10           16.00MiB
SSD detected:       no
Incompat features:  extref, skinny-metadata
Number of devices:  4
Devices:
   ID        SIZE  PATH
    1    14.75GiB  /dev/sda
    2    14.75GiB  /dev/sdb
    3    14.75GiB  /dev/sdc
    4    14.75GiB  /dev/sdd

Mount to /storage
mkdir /storage

Once you create a multi-device filesystem, you can use any device in the FS for the mount command:

mount /dev/sda /storage

Add the following to /etc/fstab
/dev/sda      /storage  btrfs   defaults        0       1

Maintenance

Don't forget to periodically scrub:
btrfs scrub start /storage

Check status of scrub:
btrfs scrub status /storage

scrub status for 77fecf7f-a8d3-43ba-89a1-9448baa463a9
   scrub started at Sat Mar  3 03:26:43 2018 and finished after 00:00:51
   total bytes scrubbed: 1006.91MiB with 0 errors

Speed tests with btrfs

RAID 10 Write Speed Test:
[root@pi02 ~]# time dd if=/dev/zero of=/storage/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 60.8095 s, 8.6 MB/s

real   1m0.850s
user   0m0.001s
sys   0m4.429s

real   0m25.362s
user   0m0.008s
sys   0m0.103s

RAID 10 Read Speed Test:
[root@pi02 storage]# dd if=/storage/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 23.4445 s, 22.4 MB/s

Try to break btrfs array

I pulled one drive while writing a 1GB file with no negative impact on the array.

Check status (See device 2 is missing):
[root@manjaro-borx01 pi]# btrfs filesystem show
Label: none  uuid: d0f8e192-573b-4f4c-b81e-caa229a2c06c
   Total devices 4 FS bytes used 2.15GiB
   devid    1 size 14.75GiB used 3.01GiB path /dev/sdc
   devid    3 size 14.75GiB used 3.01GiB path /dev/sde
   devid    4 size 14.75GiB used 3.01GiB path /dev/sdf
   *** Some devices missing
Replace missing drive:
btrfs replace start -f 2 /dev/sdg /storage
Show status:
[root@manjaro-borx01 pi]# btrfs filesystem show
Label: none  uuid: d0f8e192-573b-4f4c-b81e-caa229a2c06c
   Total devices 4 FS bytes used 2.15GiB
   devid    1 size 14.75GiB used 3.06GiB path /dev/sdc
   devid    2 size 14.75GiB used 3.06GiB path /dev/sdg
   devid    3 size 14.75GiB used 3.06GiB path /dev/sde
   devid    4 size 14.75GiB used 3.06GiB path /dev/sdf

Setup Storage on pi03 with zfs

Source: https://project.altservice.com/issues/521
Source: https://wiki.archlinux.org/index.php/ZFS

Prepare the Arch environment:
pacman -Syu

pacman -S base-devel cmake linux-headers

Uncomment the following in your sudoers file:
%wheel ALL=(ALL) ALL

Make sure you normal user is in the wheel group.

Now manually download spl-dkms and zfs-skms from AUR and extract in a regular users home directory:
wget https://aur.archlinux.org/cgit/aur.git/snapshot/spl-dkms.tar.gz
wget https://aur.archlinux.org/cgit/aur.git/snapshot/zfs-dkms.tar.gz

tar xvzf spl-dkms.tar.gz
tar xvzf zfs-dkms.tar.gz

Now two things need to be done in PKGBUILD for both packages:
  • Change the arch line to look like the following:
    arch=("i686" "x86_64" "armv6h")
    
  • When I did the install, 0.7.5 was the version in AUR but it had issues building and 0.7.6 was available. As a result I needed to change pkgver to:
    pkgver=0.7.6
    

Note: The following takes forever to install on a Raspberry Pi 1. Be patient it is working. It can take a well over an hour.

Now install spl-dkms:
cd spl-dkms
makepkg -csi

Now install zfs-dkms:
cd ../zfs-dkms
makepkg -csi

The above takes so long you may not be around when it is ready to actually install with pacman. IF you miss the prompt you can run the following:
sudo pacman -U zfs-dkms-0.7.6-1-armv6h.pkg.tar.xz zfs-utils-0.7.6-1-armv6h.pkg.tar.xz

Now reboot to make sure you are running the proper kernel.

Install the zfs kernel module:
depmod -a
modprobe zfs
Check that the zfs modules were loaded:
lsmod

zfs                  1229845  0 
zunicode              322454  1 zfs
zavl                    5993  1 zfs
zcommon                43765  1 zfs
znvpair                80689  2 zfs,zcommon
spl                   165409  5 zfs,zavl,zunicode,zcommon,znvpair

Now future kernel updates will break this install and require manual intervention. To gain control over this, block upgrades to the kernel. This way you can choose when to upgrade. Edit pacman.conf and add the following line:
# Pacman won't upgrade packages listed in IgnorePkg and members of IgnoreGroup
IgnorePkg   = linux*

Lets mount some disks

Enable zfs service:
systemctl enable zfs.target

The zfs on Linux developers recommend using device ids when creating ZFS storage pools of less than 10 devices. To find the id's, simply:
ls -lh /dev/disk/by-id/
The ids should look similar to the following:
lrwxrwxrwx 1 root root 13 Feb 24 01:38 mmc-00000_0x89f9628d -> ../../mmcblk0
lrwxrwxrwx 1 root root 15 Feb 24 01:38 mmc-00000_0x89f9628d-part1 -> ../../mmcblk0p1
lrwxrwxrwx 1 root root 15 Feb 24 01:38 mmc-00000_0x89f9628d-part2 -> ../../mmcblk0p2
lrwxrwxrwx 1 root root  9 Feb 24 20:40 usb-Flash_USB_Disk_37270114F12D098819575-0:0 -> ../../sda
lrwxrwxrwx 1 root root  9 Feb 24 20:40 usb-Flash_USB_Disk_37270324074E902919147-0:0 -> ../../sdd
lrwxrwxrwx 1 root root  9 Feb 24 20:40 usb-Flash_USB_Disk_3727064C62C1975323333-0:0 -> ../../sdc
lrwxrwxrwx 1 root root  9 Feb 24 20:40 usb-Flash_USB_Disk_37270929A6E8169419149-0:0 -> ../../sdb
Create a ZFS pool named storage which will mount at /storage:
zpool create -f storage raidz usb-Flash_USB_Disk_37270114F12D098819575-0:0 usb-Flash_USB_Disk_37270324074E902919147-0:0 usb-Flash_USB_Disk_3727064C62C1975323333-0:0 usb-Flash_USB_Disk_37270929A6E8169419149-0:0

To automatically mount a pool at boot time execute:
zpool set cachefile=/etc/zfs/zpool.cache storage

In order to mount zfs pools automatically on boot you need to enable the following services and targets:
systemctl enable zfs-import-cache
systemctl enable zfs-mount
systemctl enable zfs-import.target
Reboot to test.

Maintenance

Don't forget to periodically scrub the zfs pool:
zpool scrub storage
Check status of scrub:
zpool status
  pool: storage
 state: ONLINE
  scan: scrub repaired 0B in 0h0m with 0 errors on Fri Mar  2 22:20:44 2018
config:

   NAME                                              STATE     READ WRITE CKSUM
   storage                                           ONLINE       0     0     0
     raidz1-0                                        ONLINE       0     0     0
       usb-Flash_USB_Disk_3727073D57B5E50068851-0:0  ONLINE       0     0     0
       usb-Flash_USB_Disk_37270159CFECB74014160-0:0  ONLINE       0     0     0
       usb-SanDisk_Ultra_4C531001540408112233-0:0    ONLINE       0     0     0
       usb-SanDisk_Ultra_4C531001600408109430-0:0    ONLINE       0     0     0

errors: No known data errors

Speed tests with zfs

First speed tests without tweaking

raidz (RAID5) Write Speed Test:
[root@pi03 ~]# time dd if=/dev/zero of=/storage/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 159.19 s, 3.3 MB/s

real    2m39.212s
user    0m0.014s
sys     0m5.790s

real    0m7.041s
user    0m0.008s
sys     0m0.005s

raidz (RAID5) Read Speed Test:
[root@pi03 ~]# dd if=/storage/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 21.7062 s, 24.2 MB/s

Try to break zfs raidz array

First check status of working raidz array:
[root@manjaro-borx01 storage]# zpool status

  pool: storage
 state: ONLINE
  scan: none requested
config:

NAME                                              STATE     READ WRITE CKSUM
storage                                           ONLINE       0     0     0
  raidz1-0                                        ONLINE       0     0     0
    usb-Flash_USB_Disk_3727073D57B5E50068851-0:0  ONLINE       0     0     0
    usb-Flash_USB_Disk_37271220BA47887714528-0:0  ONLINE       0     0     0
    usb-SanDisk_Ultra_4C531001540408112233-0:0    ONLINE       0     0     0
    usb-SanDisk_Ultra_4C531001600408109430-0:0    ONLINE       0     0     0

errors: No known data errors

Now while writing a large file to the array, pluck out a random drive and check status:
[root@manjaro-borx01 storage]# zpool status

  pool: storage
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid.  Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: none requested
config:

NAME                                              STATE     READ WRITE CKSUM
storage                                           DEGRADED     0     0     0
  raidz1-0                                        DEGRADED     0     0     0
    usb-Flash_USB_Disk_3727073D57B5E50068851-0:0  ONLINE       0     0     0
    usb-Flash_USB_Disk_37271220BA47887714528-0:0  UNAVAIL      0     0     0
    usb-SanDisk_Ultra_4C531001540408112233-0:0    ONLINE       0     0     0
    usb-SanDisk_Ultra_4C531001600408109430-0:0    ONLINE       0     0     0

errors: No known data errors

Insert a new drive and replace UNAVAIL disk:
zpool replace storage usb-Flash_USB_Disk_37271220BA47887714528-0:0 usb-Flash_USB_Disk_37270159CFECB74014160-0:0

Check status:
[root@manjaro-borx01 storage]# zpool status

  pool: storage
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Mar  1 23:31:45 2018
1.17G scanned out of 4.66G at 4.77M/s, 0h12m to go
300M resilvered, 25.16% done
config:

NAME                                                STATE     READ WRITE CKSUM
storage                                             DEGRADED     0     0     0
  raidz1-0                                          DEGRADED     0     0     0
    usb-Flash_USB_Disk_3727073D57B5E50068851-0:0    ONLINE       0     0     0
    replacing-1                                     DEGRADED     0     0     0
      usb-Flash_USB_Disk_37271220BA47887714528-0:0  UNAVAIL      0     0     0
      usb-Flash_USB_Disk_37270159CFECB74014160-0:0  ONLINE       0     0     0  (resilvering)
    usb-SanDisk_Ultra_4C531001540408112233-0:0      ONLINE       0     0     0
    usb-SanDisk_Ultra_4C531001600408109430-0:0      ONLINE       0     0     0

Other Performance Tests

Raspberry Pi 3 with same USB Sticks

USB Write Speed Test:
mkdir USB
mount /dev/sda1 USB

[root@pi01 ~]# sync; time dd if=/dev/zero of=~/USB/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 45.4632 s, 11.5 MB/s

real    0m45.473s
user    0m0.001s
sys     0m4.613s

real    0m11.998s
user    0m0.006s
sys     0m0.000s

USB Read Speed Test:
[root@pi01 ~]# dd if=~/USB/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 22.6531 s, 23.1 MB/s

Load Average was 3.78 to 4.2 with very minimal cpu usage while initializing the RAID
  • So not CPU bound?
  • Slow USB controller?
Load Average was 5+ with very minimal cpu usage while running speed test on RAID 10

RAID 10 Write Speed Test:
[root@pi01 storage]# time dd if=/dev/zero of=/storage/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 156.956 s, 3.3 MB/s

real    2m36.963s
user    0m0.001s
sys     0m3.202s

real    0m51.538s
user    0m0.005s
sys     0m0.001s

RAID 10 Read Speed Test:
[root@pi01 storage]# dd if=/storage/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 5.77702 s, 90.8 MB/s

Core I7 with same USB Sticks

Not a great test on modern hardware

USB Write Speed Test:
mkdir USB
mount /dev/sda1 USB

[root@pi01 ~]# sync; time dd if=/dev/zero of=~/USB/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 0.416458 s, 1.3 GB/s

real   0m0.436s
user   0m0.000s
sys   0m0.270s

real   0m52.598s
user   0m0.000s
sys   0m0.000s

USB Read Speed Test:
[root@pi01 ~]# dd if=~/USB/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 0.126265 s, 4.2 GB/s

Creating and resyncing the array is only slightly faster that on the Pi 3
  • Load average 3.3+

RAID 10 Write Speed Test:
[root@manjaro-borx01 storage]# time dd if=/dev/zero of=/storage/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 0.228207 s, 2.3 GB/s

real   0m0.231s
user   0m0.007s
sys   0m0.227s

real   1m11.674s
user   0m0.003s
sys   0m0.000s

RAID 10 Read Speed Test:
[root@manjaro-borx01 storage]# dd if=/storage/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 0.0977792 s, 5.4 GB/s

GlusterFS Cluster Setup

Source: https://sysadmins.co.za/setup-a-3-node-replicated-storage-volume-with-glusterfs/
Source: https://wiki.archlinux.org/index.php/Glusterfs
Source: http://sumglobal.com/rpi-glusterfs-install/
Source: https://nickhowell.co.uk/2016/07/23/raspberry-pi-nas-with-gluster/

First thing to do is add all nodes to DNS. Once that is done install glusterfs on all nodes:
pacman -S glusterfs rpcbind

Enable and start glusterd and rpcbind service on each node:
systemctl enable rpcbind.service
systemctl enable glusterd
systemctl start glusterd

Lets call pi01 the master. From master probe each peer:
[root@pi01 ~]# gluster peer probe pi01
peer probe: success. Probe on localhost not needed
[root@pi01 ~]# gluster peer probe pi02
peer probe: success.
[root@pi01 ~]# gluster peer probe pi03
peer probe: success.

Clear any test data from /storage and then create a 'brick' sub folder in each /storage folder:
cd /storage ; mkdir brick

List the gluster pool:
[root@pi01 ~]# gluster pool list

UUID               Hostname    State
426d9109-eb3d-4e87-b116-b1b7327245c2   pi02        Connected
97963c16-0073-491d-ab04-85bf1516294b   pi03        Connected
36867961-309b-49e4-900a-b02093dee76d   localhost   Connected

Let's create our Replicated GlusterFS Volume, named gfs:
gluster volume create gfs replica 3 \
            pi01:/storage/brick \
            pi02:/storage/brick \
            pi03:/storage/brick 

volume create: gfs: success: please start the volume to access data

Ensure volume is created correctly:
[root@pi01 ~]# gluster volume info

Volume Name: gfs
Type: Replicate
Volume ID: 944475ff-82ef-4b0c-96f2-cdf946651a95
Status: Created
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: pi01:/storage/brick
Brick2: pi02:/storage/brick
Brick3: pi03:/storage/brick
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Start volume:
[root@pi01 ~]# gluster volume start gfs
volume start: gfs: success

View the status of our volume:
[root@pi01 ~]# gluster volume status gfs

Status of volume: gfs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick pi01:/storage/brick                   49152     0          Y       687
Brick pi02:/storage/brick                   49152     0          Y       609
Brick pi03:/storage/brick                   49152     0          Y       1533
Self-heal Daemon on localhost               N/A       N/A        Y       709
Self-heal Daemon on pi02                    N/A       N/A        Y       631
Self-heal Daemon on pi03                    N/A       N/A        Y       1643

Task Status of Volume gfs
------------------------------------------------------------------------------
There are no active volume tasks

From a GlusterFS level, it will allow clients to connect by default. To authorize these 3 nodes to connect to the GlusterFS Volume:
[root@pi01 ~]# gluster volume set gfs auth.allow 192.168.0.71,192.168.0.72,192.168.0.73
volume set: success

Then if you would like to remove this rule:
gluster volume set gfs auth.allow *

Now mount the volume on each host:
mkdir -p /mnt/glusterClientMount
mount.glusterfs localhost:/gfs /mnt/glusterClientMount

Verify the Mounted Volume:
[root@pi03 storage]# df -h

Filesystem      Size  Used Avail Use% Mounted on
dev             232M     0  232M   0% /dev
run             239M  280K  239M   1% /run
/dev/mmcblk0p2  7.0G  1.7G  5.0G  26% /
tmpfs           239M     0  239M   0% /dev/shm
tmpfs           239M     0  239M   0% /sys/fs/cgroup
tmpfs           239M     0  239M   0% /tmp
/dev/mmcblk0p1  200M   25M  176M  13% /boot
storage          43G  128K   43G   1% /storage
tmpfs            48M     0   48M   0% /run/user/1000
localhost:/gfs   43G  435M   43G   1% /mnt/glusterClientMount

Now add a file at /mnt/glusterClientMount on one of the nodes and check that it exists on all 3 nodes in the same location.

Make it mount at boot on all nodes!:
echo 'localhost:/gfs /mnt/glusterClientMount glusterfs defaults,_netdev 0 0' >> /etc/fstab

GlusterFS Client Setup

Source: http://docs.gluster.org/en/latest/Administrator%20Guide/Setting%20Up%20Clients/

Gluster Native Client

Add the FUSE loadable kernel module (LKM) to the Linux kernel:
modprobe fuse

Verify that the FUSE module is loaded:
dmesg | grep -i fuse

Install glusterfs tools on the client:
pacman -S glusterfs

Make sure your client is allowed to connect to the cluster:
[root@pi01 ~]# gluster volume set gfs auth.allow 192.168.0.71,192.168.0.72,192.168.0.73,192.168.0.99
volume set: success

Mount on the client:
mkdir -p /mnt/glusterClientMount
mount -t glusterfs pi01:/gfs /mnt/glusterClientMount

Make it mount at boot:
echo 'pi01:/gfs /mnt/glusterClientMount glusterfs defaults,_netdev 0 0' >> /etc/fstab

Followup

2018-03-13 Status

So pi03 with ZFS basically ate itself. It acted like a single physical drive failed du to too many write errors but I was unable to replace the device. It turns out two of the thumb drives are dead. They no longer come up with their correct drive id's. Instead they display as generic. I can partition them but one does not save and they both are tiny.

  • This looked like it would work but didn't help. Could not get the brick going after rebuilding storage:
  • Created /storage/brick2 and replaced the old brick:
    • gluster volume replace-brick gfs pi03:/storage/brick pi03:/storage/brick2/ commit force
  • Watch status of heal
    • gluster volume heal gfs info

pi01 with an mdadm array doesn't look good. One drive is flashing constantly with the cpu at around 3% and load average at 8.4. Running mdadm --detail /dev/md0 is hung. After a reboot it looks like two disks died. One is readonly and acting funny. The other is just gone although it has a status led.

  • Bottom left and top right are dead. Bottom left was the one flashing before reboot.
  • Added a 3rd drive
    • mdadm /dev/md0 -a /dev/sdb
    • mdadm --detail /dev/md0

So far the most reliable has been pi02 with btrfs. No issues yet.
  • Wear and tear better on the thumb drives?
  • Just dumb luck?
  • ZFS with the same drives on a real linux box were still ok.

2018-04-09 Status

pi01 is happy. I'm still missing one drive since I didn't have a spare but the array is in good shape.

pi02 is unreachable. Even after a power cycle I can ping or ssh into the machine.

pi03 with ZFS lost another drive. These USB sticks just suck and can't handle the load from ZFS. Actually the machine was also in bad shape because 512MB is not enough ram for this application. After a restart I could do zpool scrub and the bad drive was FAULTED.

gluster volume status shows pi01 and pi03 are present. As expected pi02 is MIA.
Topic revision: r15 - 09 Apr 2018, BobWicksall
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Wickwiki? Send feedback