Public Datasets on Amazon WebServices
Much to my excitement, Amazon is making public some really awesomely huge datasets to their Cloud customers. These datasets are available as Snapshots which can be mounted as Volumes and then accessed by instances in EC2. The datasets range in size from 10 Gigabytes to over 300 Gigabytes. There are a couple of real perks that come along with this. The first is that it doesn’t take a week to download the dataset in the first place – it took me only 5 minutes to get access to 50 Gigabytes of data. Here’s how:
1) Locate the snapshotId you’d like to use ( snap-1781757e for Wikipedia Extraction (WEX))
2) Create a volume from your AWS console – use the snapshotId here. The volume size must be larger than the snapshot. Also, I’ve found that I need to create the volume in the same zone as the instance I’m going to attach it to.
3) Attach the volume to an instance. Select the volume and then click “attach volume”. Suppose for example you chose “/dev/sdf” as the mount point.
4) ssh to your instance and mount the volume: mount /dev/sdf /vol
There you have it. But we’re not quite done – we still need to import all of this data into MySQL or Postgresql. But that’s another post…















