[gpfsug-discuss] GPFS GPFS-FPO and Hadoop (specifically MapReduce in the first instance)

Thu Jul 17 03:14:23 BST 2014

Laurence, Ed,

the GPFS team is very well aware that there is a trend in moving towards 
analytics on primary data vs exporting and importing data somewhere else 
to run analytics on it, especially if the primary data architecture is 
already scalable (e.g. GPFS based). 

we also understand the need to use/support shared storage for analytics as 
it is in many areas economically as well as performance wise superior to 
shared nothing system, particular if you have mixed non-sequential 
workloads, significant write content, high utilization, etc. 

i assume you understand that we can't share future plans / capabilities on 
a mailing list, but if you are interested in how/when you can enable an 
existing GPFS Filesystem to be used with HDFS Hadoop, please either 
contact your IBM rep to contact me or send me a direct email and we set 
something up.

------------------------------------------
Sven Oehme 
Scalable Storage Research 
email: oehmes at us.ibm.com 
IBM Almaden Research Lab 
------------------------------------------

From:   Ed Wahl <ewahl at osc.edu>
To:     gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
Date:   07/16/2014 08:31 AM
Subject:        Re: [gpfsug-discuss] GPFS GPFS-FPO and Hadoop 
(specifically MapReduce in the first instance)
Sent by:        gpfsug-discuss-bounces at gpfsug.org

It seems to me that neither IBM nor Intel have done a good job with the 
marketing and pre-sales of their hadoop connectors.

  As my site hosts both GPFS and Lustre I've been paying attention to 
this.  Soon enough I'll need some hadoop and I've been rather interested 
in who tells a convincing story.  With IBM it's been like pulling teeth, 
so far, to get FPO info. (other than pricing)  Intel has only been 
slightly better with EE. 

It was better with Panache, aka AFM,  and there are now quite a few 
external folks doing all kinds of interesting things with it.  From 
standard caching to trying local only burst buffers.  I'm hopeful that 
we'll start to see the same with FPO and EE soon.

I'll be very interested to hear more in this vein.

Ed
OSC

----- Reply message -----
From: "Laurence Alexander Hurst" <L.A.Hurst at bham.ac.uk>
To: "gpfsug-discuss at gpfsug.org" <gpfsug-discuss at gpfsug.org>
Subject: [gpfsug-discuss] GPFS GPFS-FPO and Hadoop (specifically MapReduce 
in the first instance)
Date: Wed, Jul 16, 2014 10:21 AM

Dear GPFSUG,

I've been looking into the possibility of using GPFS with Hadoop, 
especially as we already have experience with GPFS (traditional san-based) 
cluster for our HPC provision (which is part of the same network fabric, 
so integration should be possible and would be desirable).

The proof-of-concept Hadoop cluster I've setup has HDFS as well as our 
current GPFS file system exposed (to allow users to import/export their 
data from HDFS to the shared filestore).  HDFS is a pain to get data in 
and out of and also precludes us using many deployment tools to 
mass-update the nodes (I know this would also be a problem with GPFS-FPO) 
by reimage and/or reinstall.

It appears that the GPFS-FPO product is intended to provide HDFS's 
performance benefits for highly distributed data-intensive workloads with 
the same ease of use of a traditional GPFS filesystem.  One of the things 
I'm wondering is; can we link this with our existing GPFS cluster sanely? 
This would avoid having to have additional filesystem gateway servers for 
our users to import/export their data from outside the system and allow, 
as seemlessly as possible, a clear workflow from generating large datasets 
on the HPC facility to analysing them (e.g. with a MapReduce function) on 
the Hadoop facility.

Looking at FPO it appears to require being setup as a separate 
'shared-nothing' cluster, with additional FPO and (at least 3) server 
licensing costs attached.  Presumably we could then use AFM to 
ingest(/copy/sync) data from a Hadoop-specific fileset on our existing 
GPFS cluster to the FPO cluster, removing the requirement for additional 
gateway/heads for user (data) access?  At least, based on what I've read 
so far this would be the way we would have to do it but it seems 
convoluted and not ideal.

Or am I completely barking up the wrong tree with FPO?

Has anyone else run Hadoop alongside, or on top of, an existing san-based 
GPFS cluster (and wanted to use data stored on that cluster)?  Any tips, 
if you have?  How does it (traditional GPFS or GPFS-FPO) compare to HDFS, 
especial regards performance (I know IBM have produced lots of pretty 
graphs showing how much more performant than HDFS GPFS-FPO is for 
particular use cases)?

Many thanks,

Laurence
-- 
Laurence Hurst, IT Services, University of Birmingham, Edgbaston, B15 2TT
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140716/a2278682/attachment-0003.htm>