Distributed File Downloading via AWS

In this post I will describe a method for downloading a large number of files from the internet using Amazon Web Services (AWS). This is useful if you have lots of files to download, but you don’t want to tie up your own machine for days and/or you don’t have the hard drive space for the files. This is one solution to this problem, I’m sure there are many others.

This post is a prelude to a future post on quickly downloading company filings from the SEC’s EDGAR server. However, the method described is more general. It can be applied to the bulk distributed downloading of any type of data.

Development Environment

The pieces of AWS infrastructure we’ll be using are EC2 and S3. Amazon provides detailed and up-to-date instructions for setting up an AWS development environment. Use theirs.

All code is in Python 2 on Linux (Ubuntu). You will need to set-up the AWS command-line interface (awscli) for use on your local machine. You will need the Amazon Python SDK, Boto3.

I will create an EC2 instance with Ubuntu, install the dependencies, and clone the instance as an Amazon Machine Image (AMI). Once the instance image is created, we can clone our remote development environment and execute code on multiple separate but identical EC2 instances.

Programming Model

  1. Get a list of URLs you want to download
  2. Spin up n EC2 instances
  3. Split the list of URLs into n equal-sized chunks.
  4. Execute script to download URLs on each EC2 instance.
  5. Execute script to send downloaded files from each EC2 instance to S3.

Data Location

Local Machine: A Python driver for the remote python script; list of URLs.
Remote EC2 Instance: Sample list of URLs.
S3 Bucket: Downloaded files.

Implementation

To illustrate, let’s download all 5,293 zip files containing server log data for the SEC. You can read more about the data on the SEC’s website here.

Create an EC2 instance with Ubuntu. Install the necessary software dependencies.

1. Download the list of zip files from EDGAR.

sec_logfile_data_url = "https://www.sec.gov/files/EDGAR_LogFileData_thru_Jun2017.html"

## Read in LogFile list from EDGAR
sec_zipfile_list = urllib.urlretrieve(sec_logfile_data_url, "logfilelist.html")

## Generate list of log file ZIP files
sec_zipfile_list = open("logfilelist.html", "r").readlines()
sec_zipfile_list = [x.strip() for x in sec_zipfile_list]
## skip first few lines (html header data)
sec_zipfile_list = sec_zipfile_list[9:]

with open("logfile-urls.txt", "w") as outfile:
    outfile.writelines([x+'\n' for x in sec_zipfile_list])

As the code shows, the list of URLs for the zip files containing the SEC server logs will be stored in a file “logfile-urls.txt”.

2. Initiate Remote EC2 Instances

Next, spin up the EC2 instances and store the associated IP addresses in a list. In the code below AMI_ID is your Amazon Machine Image ID, the AMI code of your cloned EC2 instance and SEC_GROUP_ID is the name of your AWS security group for this AMI. I’ve chosen to create up to 20 instances (MaxCount = 20) of t2.medium CPU types. Read more about Amazon EC2 instance types here. The “t2.medium” type will be fine for our purposes.

## Spawn EC2 instances
ec2 = boto3.resource('ec2')
ec2.create_instances(ImageId = AMI_ID,
    InstanceType = "t2.medium",
    SecurityGroupIds=[SEC_GROUP_ID],
    MinCount = 1,
    MaxCount = 20)

After the instances have finished starting up (you can check via the EC2 console), store their IPs:

ec2 = boto3.resource('ec2')        
instances = ec2.instances.filter(Filters=[{'Name': 'instance-state-name', 'Values': ['running']}])
ec2_instance_list = [i.public_ip_address for i in instances]

3. Distribute URLs to Download

Split the list of URLs into 10 different chunks, one for each EC2 instance we will create. Locally, each sample of URLs that will be distributed will have a filename of the form url-IP.ADRESS, where the “IP.ADDRESS” field is the IP address of the remorte EC2 instance where the files will be distributed and downloaded. On the instances themselves, this file will be named simply “urls.txt”. Note below the variable PEM_KEY_LOC–since we are using scp to download the files, you will need to tell the AWS command-line interface where your .pem key file is located and set it here.

def send_urls_helper(start_index, stop_index, ec2_IP):
    ## create file name for sample of URLs to be sent to this EC2 instance
    urls_sample_filename = "urls-%s.txt" % ec2_IP

    ## Get sample of URLs to send to this EC2 instance
    urls = open(_URLS,'r').readlines()
    urls = urls[start_index:stop_index]
    with open(urls_sample_filename,'w') as outfile:
        outfile.writelines(urls)
        
    ## Send URLs file to EC2 instance
    os_command = "scp -o StrictHostKeyChecking=no -i %s %s ubuntu@%s:urls.txt" % (PEM_KEY_LOC, urls_sample_filename, ec2_IP)
    print os.system(os_command), os_command
    return None

## Split large URLs file into equal-sized chunks and send to each EC2 instance
def send_urls(ec2_instance_list=[], urls_location=""):
    n_instances = len(ec2_instance_list)
    urls = open(urls_location,'r').readlines()
    n_files = len(urls)
    sample_size = n_files/n_instances
    print '%d files, %d instances, %d files per instance' % (n_files, n_instances, sample_size)
    i = 0
    while i < (n_instances-1):
        send_urls_helper(i*sample_size, (1+i)*(sample_size), ec2_instance_list[i])
        i += 1
    if i == (n_instances-1):
        send_urls_helper(i*sample_size, n_files, ec2_instance_list[i])
    return None

4. Download

Using a combination of Linux’s wget (for downloading from the internet) and xargs (for multithreading), the code below sends a command to each EC2 instance to download its URLs.

for ec2_IP in ec2_instance_list:
    ec2_command = "<urls.txt xargs -n 1 -P 2 wget -P serverlogs -q", ec2_IP
    os_command = "ssh -o StrictHostKeyChecking=no -i %s -f ubuntu@%s '%s'" % (PEM_KEY_LOC, ec2_IP, ec2_command)
    os.system(os_command)

5. Collect Files

The zip files will download on each EC2 instance. After they are finished, the last step is to collect them for extracting (untar and unzip) and analyzing (for “doing science” on). The two most straight-forward methods are (1) using scp to copy the files to your local machine; and (2) transferring the files to an Amazon S3 bucket. The code below does the latter.

for ec2_IP in ec2_instance_list:
    ec2_command = "aws s3 cp log*.zip s3://%s/ --recursive" % S3_BUCKET
    os_command = "ssh -o StrictHostKeyChecking=no -i %s -f ubuntu@%s '%s'" % (PEM_KEY_LOC, ec2_IP, ec2_command)
    os.system(os_command)

S3_BUCKET is the name of your target S3 bucket. The 5,293 compressed archived files take up about 100G of space in all.

Leave a comment