Date
1 - 7 of 7
MapReduce reindexing with authentication
Boxuan Li
Hi Marc, That is an interesting solution. I was not aware of the mapreduce.application.classpath property. It is not well documented, but from what I understand, this option is used primarily to distribute the mapreduce framework rather than user files. Glad to know it can be used for user files as well. I am not 100% sure, but seems it requires you to upload the file to hdfs first (if you are using a yarn cluster). The ToolRunner, however, can add a file from local filesystem too. We prefer not to store keytab files on hdfs permanently. This difference is subtle, though. Also, we don’t use gremlin console anyway, so not being able to do so via gremlin console is not a drawback for us. Agree with you that the documentation can be enhanced. Right now it simply says “The class starts a Hadoop MapReduce job using the Hadoop configuration and jars on the classpath.”, which is too brief and assumes users have a good knowledge of Hadoop MapReduce. > One could even think of putting the mapreduce properties in the graph properties file and pass on properties of this namespace to the mapreduce client. Not sure if it’s possible, but if someone implements it, it would be very helpful for users to do quick start without worrying about the cumbersome Hadoop configs. Best regards, Boxuan 「<hadoopmarc@...>」在 2021年5月24日 週一,下午3:48 寫道: Hi Boxuan, |
|
hadoopmarc@...
Hi Boxuan,
Yes, you are right, I mixed things up by wrongly interpreting GENERIC_OPTIONS as an env variable. I did some additional experiments. though, bringing in new information. 1. It is possible to put a mapred-site.xml file on the JanusGraph classpath that is automatically loaded by the mapreduce client. When using the file below during mapreduce reindexing, I get the following exception (on purpose): gremlin> mr.updateIndex(i, SchemaAction.REINDEX).get() java.io.FileNotFoundException: File file:/tera/lib/janusgraph-full-0.5.3/hi.tgz does not exist The mapreduce config parameters are listed in https://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml The description for mapreduce.application.framework.path suggests that you can pass additional files to the mapreduce workers using this option (without any changes to JanusGraph). <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapreduce.framework.name</name> <value>local</value> </property> <property> <name>mapreduce.application.classpath</name> <value>dummy</value> </property> <property> <name>mapreduce.application.framework.path</name> <value>hi.tgz</value> </property> <property> <name>mapred.map.tasks</name> <value>2</value> </property> <property> <name>mapred.reduce.tasks</name> <value>2</value> </property> </configuration> 2. When using mapreduce reindexing in the documented way, it already issues the following warning: 08:49:55 WARN org.apache.hadoop.mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. When you would resolve your keytab issue by modifying the JanusGraph code and calling the hadoop ToolRunner, you have the additional advantage of getting rid of this warning. This would not work from the gremlin console, though, unless gremlin.sh passes the additional command line options to the java command line (ugly). So, I think I would prefer the option with mapred-site.xml. It would not hurt to slightly extend the mapreduce reindexing documentation, anyway:
|
|
Boxuan Li
Hi Marc,
toggle quoted message
Show quoted text
Thanks for your explanation. Just to avoid confusion, GENERIC_OPTIONS itself is not an env variable, but a set of configuration options (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html#Generic_Options). These options have nothing to do with environment variables. If I understand you correctly, you are saying that maybe ToolRunner interface is not required to submit files. I didn’t try but I think you are right, because what it does under the hood is simply if (line.hasOption("files")) { Which will later be picked up by hadoop client. So, theoretically, ToolRunner is not needed, and one can set hadoop config by themselves. This, however, does not seem to be documented officially anywhere, and it is not guaranteed that this string literal “tmpfiles” will not change in future versions. Note that even if one wants to set “tmpfiles” by themselves for MapReduce reindex, they still need to modify JanusGraph source code because currently hadoopConf object is created within MapReduceIndexManagement class and users have no control over it. Best regards, Boxuan
|
|
hadoopmarc@...
Hi Boxuan,
Yes, I did not finish my argument. What I tried to suggest: if the hadoop CLI command checks the GENERIC_OPTIONS env variable, then maybe also the mapreduce java client called by JanusGraph checks the GENERIC_OPTIONS env variable. The (old) blog below suggests, however, that this behavior is not present by default but requires the janusgraph code to run hadoop's ToolRunner. So, just see if this is any better than what you had in mind to implement. https://hadoopi.wordpress.com/2013/06/05/hadoop-implementing-the-tool-interface-for-mapreduce-driver/ Best wishes, Marc |
|
Boxuan Li
Hi Marc, you are right, we are indeed using this -files option :)
toggle quoted message
Show quoted text
|
|
hadoopmarc@...
Hi Boxuan,
Using existing mechanisms for configuring mapreduce would be nicer, indeed. Upon reading this hadoop command, I see a GENERIC_OPTIONS env variable read by the mapreduce client, that can have a -files option. Maybe it is possible to include a jaas file that points to the (already installed?) keytab file on the workers? Best wishes, Marc |
|
Boxuan Li
We have been using a yarn cluster to run MapReduce reindexing (Cassandra + Elasticsearch) for a long time. Recently, we introduced Kerberos-based authentication to the Elasticsearch cluster, meaning that worker nodes need to authenticate via a keytab file.
We managed to achieve so by using a hadoop command to include the keytab file when submitting the MapReduce job. Hadoop automatically copies this file and distributes it to the working directory of all worker nodes. This works well for us, except that we have to make changes to MapReduceIndexManagement class so that it accepts an org.apache.hadoop.conf.Configuration object (which is created by org.apache.hadoop.util.ToolRunner) rather than instantiate one by itself. We are happy to submit a PR for this, but I would like to hear if there is any better way of handling this. Cheers, Boxuan |
|