Create an Account
username: password:
 
  MemeStreams Logo

pero on anything - Hadoop Indexes

search

Lost
Picture of Lost
My Blog
My Profile
My Audience
My Sources
Send Me a Message

sponsored links

Lost's topics
Arts
Business
Games
Health and Wellness
Home and Garden
Miscellaneous
Current Events
Recreation
Local Information
Science
Society
Sports
Technology

support us

Get MemeStreams Stuff!


 
pero on anything - Hadoop Indexes
Topic: Technology 9:31 pm EDT, Jun  6, 2009

Just a simple order log with an unique identifier (id) and a single associated product and customer. Since we want to view our data from different perspectives we added two additional indexes on product and customer. (In this example we need two indexes because MySQL can only use the leftmost prefix of an index.)
Dumping the whole table as a single CSV-file into your Hadoop cluster would mean that you always have to use what (R)DBMS call a “full table scan”. It would be pretty much the same like removing all indexes from your MySQL-table. Try to search for all products a customer ordered without the index idx_product_customer. (In fact Hadoop would perform this full table scan an order of magnitude faster.) But it would be ridiculous to remove all indexes from your table. But that is actually what you did when you exported the whole table into a flat-file!
What you should do, and what we did with great success, is to split up your flat-file CSV and arrange the data so that you can decide beforehand which part of the data needs to be accessed. So let’s split up the data and simulate all of the indexes (besides the primary key, more on that later on). A file-system-layout could look like this:
orders/
product_A/
customer_1.csv
customer_2.csv
product_B/
customer_1.csv
customer_3.csv

So when searching all orders customer_1 placed, we just use this file-pattern orders/*/customer_1.csv. Remember: HDFS and MapReduce’s inputs (like FileInputFormat) support globbing.

Now we actually simulated indexes by partitioning the data!

From here on you can go into more detail depending on your data structure. As an example you could add the date- and id-range to the file name like this:
orders/product_A/customer_1.2009-06-04.2009-06-05.1000.2000.csv
orders/product_A/customer_1.2009-06-06.2009-06-07.5000.7000.csv

pero on anything - Hadoop Indexes



 
 
Powered By Industrial Memetics
RSS2.0