HDFS Streaming

DeepGNN on Linux supports direct HDFS/ADL stream to memory. In order to use this feature, you must have hadoop installed, a few environment variables set and GE options set.

Hadoop Download

Pip Install

Follow the Hadoop install guide, here. Make sure to verify the CLI works with the command they give before continuing.

Build from source

If you build DeepGNN from source with bazel, you can use the following target to download HDFS,

bazel test //src/cc/tests:hdfs_tests --config=linux

Environment Variables

export HADOOP_HOME=/path/to/hadoop

If building from source using bazel, keep empty and set this value in code instead

export JAVA_HOME=/path/to/java

Only enter if building from source or you manually download java jdk

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/server/"

If CLASSPATH is not already set, it will be set automatically with config_path at the top.

cores-site.xml

A core-site.xml is the main configuration file for hadoop. Below are some quick examples that can be copy and pasted.

You can test core-site files with

echo 'export HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/*' >> etc/hadoop/hadoop-env.sh
sudo bin/hdfs dfs --conf core-site.xml -ls <HDFS_PATH>

ADL Example

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
        <property>
                <name>fs.adl.impl</name>
                <value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
        </property>
        <property>
                <name>fs.adl.oauth2.refresh.url</name>
                <value>https://login.microsoftonline.com/TODO\_TENANT\_ID/oauth2/token</value>
        </property>
        <property>
                <name>fs.adl.oauth2.access.token.provider.type</name>
                <value>ClientCredential</value>
        </property>
        <property>
                <name>fs.adl.oauth2.client.id</name>
                <value>TODO\_CLIENT\_ID</value>
        </property>
        <property>
                <name>fs.adl.oauth2.credential</name>
                <value>TODO\_PASSWORD</value>
        </property>
        <property>
                <name>io.file.buffer.size</name>
                <value>4194304</value>
        </property>
        <property>
                <name>fs.parallel\-copy.use</name>
                <value>true</value>
        </property>
        <property>
                <name>fs.parallel\-copy.detect.text</name>
                <value>true</value>
        </property>
        <property>
                <name>fs.parallel\-copy.text\-file.scope\-compatible</name>
                <value>true</value>
        </property>
        <property>
                <name>fs.permissions.umask\-mode</name>
                <value>002</value>
        </property>
</configuration>

HDFS Localhost Example

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://localhost:9000</value>
        </property>
</configuration>

Graph Engine Usage

Leverage this feature by setting –data_dir to an hdfs or adl link, adding –stream and –config_path path/to/core-site.xml.