Sunday 25 December 2016

Java UDFs with Distributed Cache in Apache Pig





Sometimes, you might have to place the files in Distributed Cache and access them through Java UDFs. I would like to explain the steps for doing that in detail.


You have to identify the files you need to place in Distributed Cache and define them in Pig script as below. Here am placing two files file1, file2 in Distibuted Cache and naming them with aliases as input1 (#input1) and input2(#input2). This is the Script from which you would be accessing your Java UDF as well.


SET mapred.cache.files <Path to file1>/file1#input1,<path to file2>file2#input2;

You should also register the Java UDF Jar file from which you need to access these files in the same pig script.


REGISTER /pathtoudflib/pigudf.jar;


When you define this way, the files will be placed in distributed cache and available for accessing in the Java UDF.


In you Java UDF, you can access these files as below:

FileReader frinput1 = new FileReader("./input1");
FileReader frinput2 = new FileReader("./input2");
BufferedReader brin1 = new BufferedReader(frinput1);
BufferedReader brin2 = new BufferedReader(frinput2);

Now you can read each file, and do further processing.

Thanks for looking !!! Have a good day !!!

No comments:

Post a Comment