MemeStreams | pero on anything

Just a simple order log with an unique identifier (id) and a single associated product and customer. Since we want to view our data from different perspectives we added two additional indexes on product and customer. (In this example we need two indexes because MySQL can only use the leftmost prefix of an index.)
Dumping the whole table as a single CSV-file into your Hadoop cluster would mean that you always have to use what (R)DBMS call a “full table scan”. It would be pretty much the same like removing all indexes from your MySQL-table. Try to search for all products a customer ordered without the index idx_product_customer. (In fact Hadoop would perform this full table scan an order of magnitude faster.) But it would be ridiculous to remove all indexes from your table. But that is actually what you did when you exported the whole table into a flat-file!
What you should do, and what we did with great success, is to split up your flat-file CSV and arrange the data so that you can decide beforehand which part of the data needs to be accessed. So let’s split up the data and simulate all of the indexes (besides the primary key, more on that later on). A file-system-layout could look like this:
orders/
product_A/
customer_1.csv
customer_2.csv
product_B/
customer_1.csv
customer_3.csv
So when searching all orders customer_1 placed, we just use this file-pattern orders/*/customer_1.csv. Remember: HDFS and MapReduce’s inputs (like FileInputFormat) support globbing.
Now we actually simulated indexes by partitioning the data!
From here on you can go into more detail depending on your data structure. As an example you could add the date- and id-range to the file name like this:
orders/product_A/customer_1.2009-06-04.2009-06-05.1000.2000.csv
orders/product_A/customer_1.2009-06-06.2009-06-07.5000.7000.csv

pero on anything - Hadoop Indexes