This follows from the previous post where i tried out Hadoop. This uses spark on the same comet cluster.
To run interactively, the best thing to do first off is add this to my .bashrc
Then in the folder with the spark code (get it here), run this
# this didn't work for me
# this only worked, after commenting out
# the line that says res=ucla2015
Basically, it sleeps for 4 hours because we need to reserve multiple nodes that we will use for Spark. This reserves the machines for our use, but we will direct ssh in to do our work.
First, source the SLURM env variables file (auto created)
Then identify the namenode. You can ether
- examine ~/mycluster.conf/hdfs-site.xml and look for the namenode,
- do squeue -u $USER and identify the first machine which will be your namenode.
- the env variable $SLURMD_NODENAME will be set to the namenode
SSH into the namenode and then
# cd back to the dir with spark files
# and source the SLURM env vars
# run the python spark bindings
This will start a terminal that looks like python IDLE and you can type interactively.
The first example:
# create 100 numbers
local_data = range(100)
# parallelize loads data from local fs
# to the spark data store. This is called
# an RDD. Data is sharded over multiple machines
# but that's invisible to end user.
data = sc.parallelize(local_data)
# create a filter function
return d < 10
# glom turns a list of numbers into an array
# collect gets the info from the different machines
# and merges it back into one list
# take data, filter it, collect it back to us
Most of it is straightforward. The most confusing part is that the data is created then distributed across HDFS but that is invisible to us. The data is split across machines with the parallelize() function. Spark calls this an RDD – resillient distributed dataset. The data cannot be displayed directly (ie by typign ‘data’) because it exists with part on one machine, part on another machine. So to get it back, you have to tell spark to go get the data from the various machines then assemble it so that it may be displayed. That’s what collect() does. Glom(), as far as i understand, creates one object/entity per machine, which can then be joined.
Spark can read from local files, HDFS and probably other sources. The main advantage of Spark over Hadoop is that all the file manipulations are invisible. The code you would use on your personal computer to analyze one data set is almost the same as you would use for Spark. It’s just that Spark handles the sharding of data for you, managing, copying, etc. Talking to a colleague, there’s no reason to use Hadoop unless you have legacy code/data.
In the Hadoop/Spark parlance, be wary whenever you generate a ‘shuffle’ command. Those are the ones that require communication between nodes, such as when generating final output since this generates a lot of network traffic which is the whole thing you’re trying to avoid.
Cacheing is also recommended immediately after cleaning a data set to your liking. This makes the read-back super fast instead of hitting all the disks on all the machines and waiting for response.
Lots of Machine learning libraries and SQL like syntax, so this looks like fun stuff. All with the convenience (I guess) of python.
IPython notebooks can also be viewable on Comet with the program running locally on your computer and communicating back to the cluster.