Rebuilding HDP Hive: patch, test and build

Rebuilding HDP Hive: patch, test and build

Do you like our work......we hire!

Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.

The Hortonworks HDP distribution will soon be deprecated in favor of Cloudera’s CDP. One of our clients wanted a new Apache Hive feature backported into HDP 2.6.0. We thought it was a good opportunity to try to manually rebuild HDP 2.6.0’s Hive version. This article will go through the process of backporting a list of patches provided by Hortonworks on top of the open source Hive project in the most automated way.

Re-building HDP 2.6.0’s Hive

Our first step was to see if we were able to build Apache Hive as it is in HDP 2.6.0.

For each HDP release, Hortonworks (now Cloudera) describes the list of patches applied on top of a given component’s release in the release notes. We will retrieve here the list of patches applied on top of Apache Hive 2.1.0 for HDP 2.6.0 and put them in a text file:

cat hdp_2_6_0_0_patches.txt

HIVE-9941
HIVE-12492
HIVE-14214
...
...
HIVE-16319
HIVE-16323
HIVE-16325

We know there are 120 of them:

wc -l hdp_2_6_0_0_patches.txt 
120 hdp_2_6_0_0_patches.txt

Now comes the question how to apply those patches?

Applying the patches in the order provided by Hortonworks

The first step is to clone the Apache Hive repository and checkout the version HDP 2.6.0 embeds which is the 2.1.0:

git clone https://github.com/apache/hive.git
git checkout rel/release-2.1.0

Our first guess was that the patches were listed on HDP’s release notes in the chronological order of merging in Hive’s master branch so we tried to get the patches and apply them in that order.

For each patch in the list, we need to fetch the patch file issued in the associated JIRA and run the git apply command against it. It can be tedious for a large amounts of patches so we will see later in this article how to automate it.

Let’s start small by trying to apply the first patch of our list: HIVE-9941. We can see in the associated JIRA issue that there are multiple attachments:

HIVE-9941 attachments

We will pick the most recent one and apply it:

wget https://issues.apache.org/jira/secure/attachment/12833864/HIVE-9941.3.patch

git apply HIVE-9941.3.patch 
HIVE-9941.3.patch:17: new blank line at EOF.
+
HIVE-9941.3.patch:42: new blank line at EOF.
+
HIVE-9941.3.patch:71: new blank line at EOF.
+
HIVE-9941.3.patch:88: new blank line at EOF.
+
warning: 4 lines add whitespace errors.

The git apply command gave us a few warnings about some white spaces and blank lines but the application of the patch worked. The success is confirmed with git status:

git status
HEAD detached at rel/release-2.1.0
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	HIVE-9941.3.patch
	ql/src/test/queries/clientnegative/authorization_alter_drop_ptn.q
	ql/src/test/queries/clientnegative/authorization_export_ptn.q
	ql/src/test/queries/clientnegative/authorization_import_ptn.q
	ql/src/test/queries/clientnegative/authorization_truncate_2.q
	ql/src/test/results/clientnegative/authorization_alter_drop_ptn.q.out
	ql/src/test/results/clientnegative/authorization_export_ptn.q.out
	ql/src/test/results/clientnegative/authorization_import_ptn.q.out
	ql/src/test/results/clientnegative/authorization_truncate_2.q.out

nothing added to commit but untracked files present (use "git add" to track)

In this case, the patch only created new files and did not perform any modifications but it worked regardless.

Moving on to the second patch of the list, HIVE-12492:

wget https://issues.apache.org/jira/secure/attachment/12854164/HIVE-12492.02.patch

git apply HIVE-12492.02.patch
HIVE-12492.02.patch:362: trailing whitespace.
        Map 1 
[...]
error: src/java/org/apache/hadoop/hive/conf/HiveConf.java: No such file or directory
error: src/test/resources/testconfiguration.properties: No such file or directory
error: src/java/org/apache/hadoop/hive/ql/optimizer/ConvertJoinMapJoin.java: No such file or directory
error: src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java: No such file or directory
error: src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java: No such file or directory

This time, we encountered multiple No such file or directory errors. Tracking the cause of the error is kind of tricky. Let me show why it is so and why it make the all patching process hard to automate.

Once we download and open the “HIVE-12492.02.patch” path file, we see at the beginining that the “No such file or directory” error is applying to the file common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:

diff --git common/src/java/org/apache/hadoop/hive/conf/HiveConf.java common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
index 3777fa9..08422d5 100644
--- common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
+++ common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
@@ -1422,6 +1422,11 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "This controls how many partitions can be scanned for each partitioned table.\n" +
[...]

But the error is telling that the src/java/org/apache/hadoop/hive/conf/HiveConf.java does not exist instead of common/src/java/org/apache/hadoop/hive/conf/HiveConf.java. Notice how the second path prefixes the first one with common. This is because git apply has a property -p (described here) which removes the leading part components of file paths. By default it is set to 1 and not 0, removing the first directory, common in our case.

We did not encounter this problem with the previous patch because the diff made use of prefixing in the file paths, for example:

diff --git a/ql/src/test/queries/clientnegative/authorization_alter_drop_ptn.q b/ql/src/test/queries/clientnegative/authorization_alter_drop_ptn.q

Let’s try again with -p0:

git apply -p0 HIVE-12492.02.patch

HIVE-12492.02.patch:362: trailing whitespace.
        Map 1 
[...]
error: patch failed: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:1422
error: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java: patch does not apply
error: patch failed: itests/src/test/resources/testconfiguration.properties:501
error: itests/src/test/resources/testconfiguration.properties: patch does not apply
error: patch failed: ql/src/java/org/apache/hadoop/hive/ql/optimizer/ConvertJoinMapJoin.java:53
error: ql/src/java/org/apache/hadoop/hive/ql/optimizer/ConvertJoinMapJoin.java: patch does not apply
error: patch failed: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java:51
error: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java: patch does not apply

This is better but still the patch won’t apply. This time we have an error patch does not apply which is clearly indicating us the patch can not be applied.

Looking back at the JIRA list given by Hortonworks, we realized that they actually were given in an alphanumerical order. If we take the example of these two issues: HIVE-14405 and HIVE-14432. The first one has been created before the second one but the patch from the later has been release before the former. If these two patches were to modify the same file, it could lead to the error we just encountered.

Applying the patches in the alphanumerical order seems like a bad choice in this case. We will now order the patches in chronological order according to the date when they were committed/applied in the master branch.

Applying the JIRA patches in the commit order

Let’s try to iterate through every patch to see how many patches we can apply by going with this strategy.

Getting and applying every patches file automatically from every JIRA’s attachments can be tedious:

  • There are 120 of them
  • We have just seen some patches are applied at -p0 level while others are at -p1
  • Some JIRA indicated by Hortonworks do not have any attachments (ex: HIVE-16238)
  • Some JIRA have attachments that are not patches (ex: HIVE-16323 has a screenshot attached next to the patches)
  • Some JIRA have patches in .txt format instead of .patch (ex: HIVE-15991)

Here is an excerpt (in CoffeeScript) from the script we made to get the latest version of each patch from JIRA:

generate_patch_manifest = ->
  # For each patch in the given list, return a manifest
  patches_manifest = hdp_2_6_0_0_patches
  # Generate a manifest with patch id, description, creation date and url
  .map (patch) ->
    # Parse the JIRA id
    jira_id = patch.split(/:(.+)/)[0]
    # Parse the commit description
    commit_description = patch.split(/:(.+)/)[1]
    # Get all the attachments
    response = await axios.get "https://issues.apache.org/jira/rest/api/2/search?jql=key=#{jira_id}&fields=attachment"
    # Filter the attachments that are actual patches
    attachments = response.data.issues[0].fields.attachment.filter (o) -> o.filename.includes jira_id
    # Get only the latest patch
    most_recent_created_time = Math.max.apply Math, attachments.map (o) -> Date.parse o.created
    most_recent_created_attachment = attachments.filter (o) -> Date.parse(o.created) is most_recent_created_time
    # Create the patch manifest (implicit return in CoffeeScript)
    id: jira_id
    description: commit_description
    created: Date.parse most_recent_created_attachment[0].created
    url: most_recent_created_attachment[0].content
  # Order the patches by date
  .sort (a,b) ->
    a.created - b.created
  # Write tbe object to a YAML file
  fs.writeFile './patch_manifest.yml', yaml.safeDump(patches_manifest), (err) ->
    if err
    then console.error err 
    else console.info 'done'

This method generates a patch_manifest.yaml file that looks like:

- id: $JIRA_NUMBER
  description: $COMMIT_MESSAGE
  created: $TIMESTAMP
  url: $LATEST_PATCH_URL

Creating this manifest is going to help us applying the patches in right order.

Once we have built this manifest, we iterate through the list, download and try to apply as much patches as we can. Here is another code excerpt to do that:

# For each patch in the manifest
for patch, index in patches_manifest
  try
    await nikita.execute """
    # Download the patch file
    wget -q -N #{patch.url}

    # If the patch as the "a/" leading part in the files, apply with -p1
    if grep "git \\\\ba/" patches/#{patch.url.split('/').pop()}; then
      git apply patches/#{patch.url.split('/').pop()}

    # If not, then apply with -p0
    else
      git apply -p0 patches/#{patch.url.split('/').pop()}
    fi

    # Commit changes
    git add -A
    git commit -m "Applied #{patch.url.split('/').pop()}"
    """
    console.info "Patch #{patch.url.split('/').pop()} applied successfully"
  catch err
    console.error "Patch #{patch.url.split('/').pop()} failed to apply"

This did not go as good as expected. Only 19 out of the 120 patches in the list applied successfully. All failures were due to patch does not apply errors which could be because:

  1. We are still not applying the fixes in the right order
  2. We are missing some code pre-requisites (ie: the JIRA list is incomplete)
  3. Some code need to be modified to be backported
  4. All of the above

In the next part, we will use another strategy by cherry-picking the commits related to the HIVE-XXXXX JIRAs we want to have in our build.

Applying the patch by pushed-to-master chronological date

Every commit message in Apache Hive has the related JIRA issue in its description:

Hive commits

In this attempt, we are going to get the list of all the commits pushed to the Apache Hive master branch in order to generate the list of the commits containing the patches we want to apply in a chronological order.

git checkout master
git log --pretty=format:'%at;%H;%s' > apache_hive_master_commits.csv

The git log command with the right parameters generates a CSV file with the format: timestamp;commit_hash;commit_message.

For example:

head apache_hive_master_commits.csv

1597212601;a430ac31441ad2d6a03dd24e3141f42c79e022f4;HIVE-23995:Don't set location for managed tables in case of replication (Aasha Medhi, reviewed by Pravin Kumar Sinha)
1597206125;ef47b9eb288dec6e6a014dd5e52accdf0ac3771f;HIVE-24030: Upgrade ORC to 1.5.10 (#1393)
1597144690;d4af3840f89408edea886a28ee4ae7d79a6f16f8;HIVE-24014: Need to delete DumpDirectoryCleanerTask (Arko Sharma, reviewed by Aasha Medhi)
1597135126;71c9af7b8ead470520a6c3a4848be9c67eb80f10;HIVE-24001: Don't cache MapWork in tez/ObjectCache during query-based compaction (Karen Coppage, reviewed by Marta Kuczora)
1597108110;af911c9fe0ff3bb49d80f6c0109d30f5e1046849;HIVE-23996: Remove unused line in UDFArgumentException (Guo Philipse reviewed by Peter Vary, Jesus Camacho Rodriguez)
1597107811;17ee33f4f4d8dc9437d1d3a47663a635d2e47b58;HIVE-23997: Some logs in ConstantPropagateProcFactory are not straightforward (Zhihua Deng, reviewed by Jesus Camacho Rodriguez)
1596833809;5d9a5cf5a36c1d704d2671eb57547ea50249f28b;HIVE-24011: Flaky test AsyncResponseHandlerTest ( Mustafa Iman via Ashutosh Chauhan)
1596762366;9fc8da6f32b68006c222ccfd038719fc89ff8550;HIVE-24004: Improve performance for filter hook for superuser path(Sam An, reviewed by Naveen Gangam)
1595466890;7293956fc2f13ccbe72825b67ce1b53dce536359;HIVE-23901: Overhead of Logger in ColumnStatsMerger damage the performance
1593631609;4457c3ec9360650be021ea84ed1d5d0f007d8308;HIVE-22934 Hive server interactive log counters to error stream ( Ramesh Kumar via Ashutosh Chauhan)

This next script will match only the commits we want to apply by matching the JIRA ids provided by Hortonworks. Then we order the CSV by date from oldest to newest:

cat hdp_2_6_0_0_patches.txt | while read in; do grep $in apache_hive_master_commits.csv; done > commits_to_apply.csv
sort commits_to_apply.csv > sorted_commits_to_apply.csv

Now that we have an ordered list of commits we want to apply we can go back to the 2.1.0 release tag and cherry-pick these commits.

Cherry-picking seems like a better solution than patch applying in our situation. Indeed, we have just seen that we could not apply some patches because of missing code dependency.

Let’s try with the first one, the JIRA is HIVE-14214 and the commit hash is b28ec7fdd8317b47973c6c8f7cdfe805dc20a806:

head -1 sorted_commits_to_apply.csv
1469348583;b28ec7fdd8317b47973c6c8f7cdfe805dc20a806;HIVE-14214: ORC Schema Evolution and Predicate Push Down do not work together (no rows returned) (Matt McCline, reviewed by Prasanth Jayachandran/Owen O'Malley)

git cherry-pick -x 15dd7f19bc49ee7017fa5bdd65f9b4ec4dd019d2 --strategy-option theirs
CONFLICT (modify/delete): ql/src/test/results/clientpositive/tez/explainanalyze_5.q.out deleted in HEAD and modified in 15dd7f19bc... HIVE-15084: Flaky test: TestMiniTezCliDriver:explainanalyze_1, 2, 3, 4, 5 (Pengcheng Xiong). Version 15dd7f19bc... HIVE-15084: Flaky test: TestMiniTezCliDriver:explainanalyze_1, 2, 3, 4, 5 (Pengcheng Xiong) of ql/src/test/results/clientpositive/tez/explainanalyze_5.q.out left in tree.
[...]
CONFLICT (modify/delete): ql/src/test/queries/clientpositive/explainanalyze_1.q deleted in HEAD and modified in 15dd7f19bc... HIVE-15084: Flaky test: TestMiniTezCliDriver:explainanalyze_1, 2, 3, 4, 5 (Pengcheng Xiong). Version 15dd7f19bc... HIVE-15084: Flaky test: TestMiniTezCliDriver:explainanalyze_1, 2, 3, 4, 5 (Pengcheng Xiong) of ql/src/test/queries/clientpositive/explainanalyze_1.q left in tree.
error: could not apply 15dd7f19bc... HIVE-15084: Flaky test: TestMiniTezCliDriver:explainanalyze_1, 2, 3, 4, 5 (Pengcheng Xiong)

We removed some of the outputs but the takeaway here is that we are having some conflicts. git cherry-pick is trying to git merge a commit from the official Hive repository into our current local branch. The benefit of cherry-picking is that we are comparing two file instead of just applying a diff, leaving us the choice on how to merge the content.

To do that, we need to add the --strategy-option theirs parameter to the cherry-pick command. This will solve most conflicts but we can still encounter conflicts like:

Unmerged paths:
  (use "git add/rm ..." as appropriate to mark resolution)
	deleted by us:   llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java
	added by them:   llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/VectorDeserializeOrcWriter.java

The conflict resolution must be done manually with git add or git rm depending on the situation.

Using the cherry-picking method and the manual resolutions for some patches, we were able to import all 120 JIRA patches in our code branch.

Important note: We tried to automate it as much as possible but keep in mind that running the unit tests after patching is extremely recommended before building and applying the fix to any production cluster. We will see that in the next part.

git checkout rel/release-2.1.0

while read line; do
  commit=$(echo $line | awk -F "\"*;\"*" '{print $2}')
  if git cherry-pick -x $commit --strategy-option theirs > patches/logs/cherry_pick_$commit.out 2>&1; then
    echo "Cherry picking $commit SUCCEEDED"
  else
    git status > patches/logs/cherry_pick_$commit.out 2>&1
    git status | sed -n 's/.* by .*://p' | xargs git add
    git -c core.editor=true cherry-pick --continue
    echo "Cherry picking $commit SUCEEDED with complications"
  fi
done < sorted_commits_to_apply.csv

With the script above, we are able to apply 100% of the patches. In the next part, we will go over the steps to build and test the Hive distribution.

Compilation

Before running the unit tests, let’s try to see if we can actually build the Hive release:

mvn clean package -Pdist -DskipTests

After a few minutes, the build stops with the following error:

[ERROR] /home/leo/Apache/hive/storage-api/src/test/org/apache/hadoop/hive/ql/exec/vector/TestStructColumnVector.java:[106,5] cannot find symbol
  symbol:   class Timestamp
  location: class org.apache.hadoop.hive.ql.exec.vector.TestStructColumnVector

Looking at the file ./storage-api/src/test/org/apache/hadoop/hive/ql/exec/vector/TestStructColumnVector.java, the class java.sql.Timestamp, is not imported indeed.

This file was modified by cherry-picking the HIVE-16245. This patch changes 3 files:

  • ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java
  • storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatch.java
  • storage-api/src/test/org/apache/hadoop/hive/ql/exec/vector/TestStructColumnVector.java

How did Hortonworks apply this specific patch without breaking the build? In the next part, we will dig into the Hortonworks’ public Hive repository to find clues on how they applied the patches listed for HDP 2.6.0.0.

Exploring Hortonworks’ hive2-release project

On GitHub, Hortonworks keeps a public repository for every component in the HDP stack. The one we will want to check is [hive2-release](http://web.archive.org/web/20200906030301/http://web.archive.org/web/20200906030301/https://github.com/hortonworks/hive2-release//):

git clone http://web.archive.org/web/20200906030301/http://web.archive.org/web/20200906030301/https://github.com/hortonworks/hive2-release//.git
cd hive2-release && ls

README.md

The repository seems empty at first but the files can actually be found by checking out the tags. There is one tag per HDP release.

git checkout HDP-2.6.0.3-8-tag

Let’s try to find out how Hortonworks applied the patch which provoked the build error we have seen earlier.

git log --pretty=format:'%at;%H;%s' | grep HIVE-16245
1489790314;2a9150604edfbc2f44eb57104b4be6ac3ed084c7;BUG-77383 backport HIVE-16245: Vectorization: Does not handle non-column key expressions in MERGEPARTIAL mode

Now, let’s see which files were modified by this commit:

git diff 2a9150604edfbc2f44eb57104b4be6ac3ed084c7^ | grep diff

diff --git a/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java b/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java
diff --git a/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatch.java b/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatch.java

As opposed to what’s in the patch (and been pushed to Apache Hive’s master branch), the backport operated by Hortonworks only changes 2 out of 3 files. There seems to be some black magic involved or the process is manual and not fully documented. The files that provoked the build error we encountered earlier is not changed here.

Conclusion

Backporting a list of patches to an open source project like Apache Hive is hard to automate. With the HDP distribution and depending on the patches, Hortonworks has applied some manual changes to the source code or its engineers were working on a forked version of Hive. A good understanding of the global project structure (modules, unit tests, etc.) and its ecosystem (dependencies) is required in order to build a functional and tested release.

Share this article

Canada - Morocco - France

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.

Support Ukrain