Wiki‎ > ‎

Data Backup to Box

posted Jul 24, 2019, 10:18 AM by Danny Xu   [ updated Jul 24, 2019, 10:25 AM ]


  • We use Google drive for the large files.  5TB max filesize, but they throttle upload speed if you exceed 750 GB in a day.  Fine for archive, but not for analysis.     I’ve seen 20+ MB/sec on transfers, but for more than 750GB the daily limit  slows this down.     This is fine for archive, which is important, but not for analysis.
  • We also have the Box agreement here at OSU.  The 15GB size limit is a pain. We recommend using Box as backup and not for active data analysis.  We also found out that there is a limit to unlimited data     *Storage Used*  909.5 TB of Unlimited
  • At Northwestern we have the same Box agreement. Although the service is for unlimited storage. the 15GB file size limit is definitely making large transfers very difficult for our users.
  • ISU (Iowa State) offers Google Drive and Box with unlimited storage, buteither the filesize of transfer rate is limited so that this does not look like a local filesystem.

Box is actually fairly fast (we have gotten over 500 MB/sec. with parallel streams from a parallel filesystem) when large numbers of files are transferred, but the files cannot be large (we paid to get 15GB filesize limits, but the standard is 5GB limit).

==============File Split and Backup to Box===========================



FTPS is a bit of a pain, but it works after setting a local, non-federated, password for the Box user's account. Box is a great place to archive old project directories, or share out data to third parties.  We do use Rclone for this.  Highly recommended.

A few tips on using Rclone with Box...  Watch our for the 5GB file limit.  You can get around this with tar and the Linux split utility.  I prefer zip then zipsplit for end-users, though.  Don't try to upload thousands individual files.  The overhead of statting and comparing each file is stupendous.

  Something like:
  zip -v -r DIR dirtozip
  zipsplit -n $(( 4 * 1024 * 1024 * 1024 )) DIR.zip

 
  If you really want to get fancy, submitting all the commands through the scheduler, using tar and split...

  mkdir /home/bob/project01-archive

  qsub -N tar_project01 -o /home/bob/.config/rclone/ -m e -M bug@wharton.upenn.edu -j y -b y "tar -cvf - /home/bob/project01 |xz -1 - |split -b 5000000000 - /home/bob/project01-archive/project01.tar.xz."   

  qsub -N box_project01 -o /home/bob/.config/rclone/ -m e -M bug@wharton.upenn.edu -j y -b y "rclone copy -u -v --transfers 8 /home/bob/project01-archive Box:HPCC/project01-archive"

  rclone ls Box:HPCC/project01-archive

  rm -rf /home/bob/project01 /home/bob/project01-archive

=========================================

Globus now supports Box. That might help with some of the issues raised here, like federated auth and getting data on and off HPC systems.

https://www.globus.org/press-releases/globus-announces-integration-box-cloud-content-management


Comments