[gpfsug-discuss] GPFS GPFS-FPO and Hadoop (specifically MapReduce in the first instance)
Sven Oehme
oehmes at us.ibm.com
Thu Jul 17 03:14:23 BST 2014
Laurence, Ed,
the GPFS team is very well aware that there is a trend in moving towards
analytics on primary data vs exporting and importing data somewhere else
to run analytics on it, especially if the primary data architecture is
already scalable (e.g. GPFS based).
we also understand the need to use/support shared storage for analytics as
it is in many areas economically as well as performance wise superior to
shared nothing system, particular if you have mixed non-sequential
workloads, significant write content, high utilization, etc.
i assume you understand that we can't share future plans / capabilities on
a mailing list, but if you are interested in how/when you can enable an
existing GPFS Filesystem to be used with HDFS Hadoop, please either
contact your IBM rep to contact me or send me a direct email and we set
something up.
------------------------------------------
Sven Oehme
Scalable Storage Research
email: oehmes at us.ibm.com
IBM Almaden Research Lab
------------------------------------------
From: Ed Wahl <ewahl at osc.edu>
To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
Date: 07/16/2014 08:31 AM
Subject: Re: [gpfsug-discuss] GPFS GPFS-FPO and Hadoop
(specifically MapReduce in the first instance)
Sent by: gpfsug-discuss-bounces at gpfsug.org
It seems to me that neither IBM nor Intel have done a good job with the
marketing and pre-sales of their hadoop connectors.
As my site hosts both GPFS and Lustre I've been paying attention to
this. Soon enough I'll need some hadoop and I've been rather interested
in who tells a convincing story. With IBM it's been like pulling teeth,
so far, to get FPO info. (other than pricing) Intel has only been
slightly better with EE.
It was better with Panache, aka AFM, and there are now quite a few
external folks doing all kinds of interesting things with it. From
standard caching to trying local only burst buffers. I'm hopeful that
we'll start to see the same with FPO and EE soon.
I'll be very interested to hear more in this vein.
Ed
OSC
----- Reply message -----
From: "Laurence Alexander Hurst" <L.A.Hurst at bham.ac.uk>
To: "gpfsug-discuss at gpfsug.org" <gpfsug-discuss at gpfsug.org>
Subject: [gpfsug-discuss] GPFS GPFS-FPO and Hadoop (specifically MapReduce
in the first instance)
Date: Wed, Jul 16, 2014 10:21 AM
Dear GPFSUG,
I've been looking into the possibility of using GPFS with Hadoop,
especially as we already have experience with GPFS (traditional san-based)
cluster for our HPC provision (which is part of the same network fabric,
so integration should be possible and would be desirable).
The proof-of-concept Hadoop cluster I've setup has HDFS as well as our
current GPFS file system exposed (to allow users to import/export their
data from HDFS to the shared filestore). HDFS is a pain to get data in
and out of and also precludes us using many deployment tools to
mass-update the nodes (I know this would also be a problem with GPFS-FPO)
by reimage and/or reinstall.
It appears that the GPFS-FPO product is intended to provide HDFS's
performance benefits for highly distributed data-intensive workloads with
the same ease of use of a traditional GPFS filesystem. One of the things
I'm wondering is; can we link this with our existing GPFS cluster sanely?
This would avoid having to have additional filesystem gateway servers for
our users to import/export their data from outside the system and allow,
as seemlessly as possible, a clear workflow from generating large datasets
on the HPC facility to analysing them (e.g. with a MapReduce function) on
the Hadoop facility.
Looking at FPO it appears to require being setup as a separate
'shared-nothing' cluster, with additional FPO and (at least 3) server
licensing costs attached. Presumably we could then use AFM to
ingest(/copy/sync) data from a Hadoop-specific fileset on our existing
GPFS cluster to the FPO cluster, removing the requirement for additional
gateway/heads for user (data) access? At least, based on what I've read
so far this would be the way we would have to do it but it seems
convoluted and not ideal.
Or am I completely barking up the wrong tree with FPO?
Has anyone else run Hadoop alongside, or on top of, an existing san-based
GPFS cluster (and wanted to use data stored on that cluster)? Any tips,
if you have? How does it (traditional GPFS or GPFS-FPO) compare to HDFS,
especial regards performance (I know IBM have produced lots of pretty
graphs showing how much more performant than HDFS GPFS-FPO is for
particular use cases)?
Many thanks,
Laurence
--
Laurence Hurst, IT Services, University of Birmingham, Edgbaston, B15 2TT
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140716/a2278682/attachment-0003.htm>
More information about the gpfsug-discuss
mailing list