[gpfsug-discuss] GPFS GPFS-FPO and Hadoop (specifically MapReduce in the first instance)

Wed Jul 16 16:29:49 BST 2014

It seems to me that neither IBM nor Intel have done a good job with the marketing and pre-sales of their hadoop connectors.

  As my site hosts both GPFS and Lustre I've been paying attention to this.  Soon enough I'll need some hadoop and I've been rather interested in who tells a convincing story.  With IBM it's been like pulling teeth, so far, to get FPO info. (other than pricing)  Intel has only been slightly better with EE.

It was better with Panache, aka AFM,  and there are now quite a few external folks doing all kinds of interesting things with it.  From standard caching to trying local only burst buffers.  I'm hopeful that we'll start to see the same with FPO and EE soon.

I'll be very interested to hear more in this vein.

Ed
OSC

----- Reply message -----
From: "Laurence Alexander Hurst" <L.A.Hurst at bham.ac.uk>
To: "gpfsug-discuss at gpfsug.org" <gpfsug-discuss at gpfsug.org>
Subject: [gpfsug-discuss] GPFS GPFS-FPO and Hadoop (specifically MapReduce in the first instance)
Date: Wed, Jul 16, 2014 10:21 AM

Dear GPFSUG,

I've been looking into the possibility of using GPFS with Hadoop, especially as we already have experience with GPFS (traditional san-based) cluster for our HPC provision (which is part of the same network fabric, so integration should be possible and would be desirable).

The proof-of-concept Hadoop cluster I've setup has HDFS as well as our current GPFS file system exposed (to allow users to import/export their data from HDFS to the shared filestore).  HDFS is a pain to get data in and out of and also precludes us using many deployment tools to mass-update the nodes (I know this would also be a problem with GPFS-FPO) by reimage and/or reinstall.

It appears that the GPFS-FPO product is intended to provide HDFS's performance benefits for highly distributed data-intensive workloads with the same ease of use of a traditional GPFS filesystem.  One of the things I'm wondering is; can we link this with our existing GPFS cluster sanely?  This would avoid having to have additional filesystem gateway servers for our users to import/export their data from outside the system and allow, as seemlessly as possible, a clear workflow from generating large datasets on the HPC facility to analysing them (e.g. with a MapReduce function) on the Hadoop facility.

Looking at FPO it appears to require being setup as a separate 'shared-nothing' cluster, with additional FPO and (at least 3) server licensing costs attached.  Presumably we could then use AFM to ingest(/copy/sync) data from a Hadoop-specific fileset on our existing GPFS cluster to the FPO cluster, removing the requirement for additional gateway/heads for user (data) access?  At least, based on what I've read so far this would be the way we would have to do it but it seems convoluted and not ideal.

Or am I completely barking up the wrong tree with FPO?

Has anyone else run Hadoop alongside, or on top of, an existing san-based GPFS cluster (and wanted to use data stored on that cluster)?  Any tips, if you have?  How does it (traditional GPFS or GPFS-FPO) compare to HDFS, especial regards performance (I know IBM have produced lots of pretty graphs showing how much more performant than HDFS GPFS-FPO is for particular use cases)?

Many thanks,

Laurence
--
Laurence Hurst, IT Services, University of Birmingham, Edgbaston, B15 2TT
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140716/381a2fca/attachment-0003.htm>