What Hadoop is.
Simply put, it's an Apache open source
project that allows users to perform highly intensive data analytics on
structured and unstructured data across hundreds or thousands of nodes
that are local or geographically dispersed. Hadoop was designed to
ingest mammoth amounts of unstructured data and, through its global file
system (Hadoop Distributed File System),
distribute workloads across a vast network of independent compute nodes
to rapidly map, sort and categorize data to facilitate big data analytical queries.
Hadoop was also designed
to natively work with the internal disk resources in each of the
independent compute nodes within its clustered framework for cost
efficiency and to ensure that data is always available locally for the
processing node. Jobs are managed and delegated across the cluster
farms, whereby data is parsed, classified and stored on local disk. One
block of data is written to the local disk, and two are replicated for redundancy. Data copies are readable so they can be used for processing tasks.
In
determining the answer to the question, we have to consider that Hadoop
is disk-agnostic; that means SAN and NAS resources can be used as the
primary storage layer to service Hadoop workloads. But a natural follow-up question is this: Will SAN and NAS storage
be the most cost-effective way to deploy Hadoop? If a Hadoop
implementation will be confined to one or two locations, the benefits of
managing a centralized storage resource should make sense, especially
if an existing array has already been depreciated.
Hadoop implementation
Thursday, October 17, 2013
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment