MemeStreams | MemeStreams Discussion

Create an Account

This page contains all of the posts and discussion on MemeStreams referencing the following web page: pero on anything - Hadoop Indexes. You can find discussions on MemeStreams as you surf the web, even if you aren't a MemeStreams member, using the Threads Bookmarklet.

pero on anything - Hadoop Indexes by Lost at 9:31 pm EDT, Jun 6, 2009
Just a simple order log with an unique identifier (id) and a single associated product and customer. Since we want to view our data from different perspectives we added two additional indexes on product and customer. (In this example we need two indexes because MySQL can only use the leftmost prefix of an index.) Dumping the whole table as a single CSV-file into your Hadoop cluster would mean that you always have to use what (R)DBMS call a “full table scan”. It would be pretty much the same like removing all indexes from your MySQL-table. Try to search for all products a customer ordered without the index idx_product_customer. (In fact Hadoop would perform this full table scan an order of magnitude faster.) But it would be ridiculous to remove all indexes from your table. But that is actually what you did when you exported the whole table into a flat-file! What you should do, and what we did with great success, is to split up your flat-file CSV and arrange the data so that you can decide beforehand which part of the data needs to be accessed. So let’s split up the data and simulate all of the indexes (besides the primary key, more on that later on). A file-system-layout could look like this: orders/ product_A/ customer_1.csv customer_2.csv product_B/ customer_1.csv customer_3.csv So when searching all orders customer_1 placed, we just use this file-pattern orders/*/customer_1.csv. Remember: HDFS and MapReduce’s inputs (like FileInputFormat) support globbing. Now we actually simulated indexes by partitioning the data! From here on you can go into more detail depending on your data structure. As an example you could add the date- and id-range to the file name like this: orders/product_A/customer_1.2009-06-04.2009-06-05.1000.2000.csv orders/product_A/customer_1.2009-06-06.2009-06-07.5000.7000.csv
Link - Reply