读取和反序列化Hadoop二进制文件

156 阅读 0 评论 103 点赞

我是靠谱客的博主年轻微笑，这篇文章主要介绍读取和反序列化Hadoop二进制文件，现在分享给大家，希望可以做个参考。

问题描述
反序列化代码

问题描述

Hadoop在运行MR时，经常要将一些中间结果存到本地，为了节省存储空间，Hadoop采用序列化机制（Hadoop的序列化机制和Java的有所不同）将数据保存为二进制文件，此时若需要观察中间结果文件进行调试，就需要将二进制文件进行反序列化为可读的字符。此篇文章只展示反序列化的代码流程，不分析其中原理。

反序列化代码

因为Hadoop采用的序列化机制是独有的，所以在编写反序列化代码之前需要导入hadoop/share/hadoop/common下的jar包。

1. 反序列化单数据文件

当序列化文件里只包含一种数据类型的数据时，用以下代码。

注：反序列化前需要知道该数据反序列化前的数据类型

复制代码

package readHadoopFile;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.Reader;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;
import org.apache.mahout.common.HadoopUtil;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.hadoop.similarity.cooccurrence.Vectors;
public class transformFile {
public static void main(String[] args) throws IllegalArgumentException, IOException {
String path = null;
//numUsers.bin只包含一个int型的数据
path = "~\temp\preparePreferenceMatrix\numUsers.bin";
int num = HadoopUtil.readInt(new Path(path), new Configuration());
System.out.println(num); #2487348
//maxValues.bin包含的数据类型是向量
path = "C:\Users\User\Desktop\推荐算法\分布式推荐\temp\maxValues.bin";
Vector maxValues = Vectors.read(new Path(path), new Configuration());
System.out.println(maxValues);
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
package readHadoopFile;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.Reader;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;
import org.apache.mahout.common.HadoopUtil;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.hadoop.similarity.cooccurrence.Vectors;
public class transformFile {
public static void main(String[] args) throws IllegalArgumentException, IOException {
String path = null;
//numUsers.bin只包含一个int型的数据
path = "~\temp\preparePreferenceMatrix\numUsers.bin";
int num = HadoopUtil.readInt(new Path(path), new Configuration());
System.out.println(num); #2487348
//maxValues.bin包含的数据类型是向量
path = "C:\Users\User\Desktop\推荐算法\分布式推荐\temp\maxValues.bin";
Vector maxValues = Vectors.read(new Path(path), new Configuration());
System.out.println(maxValues);
}
}

2. 反序列化Key-Value类型的数据文件

当序列化文件里的数据是Key-Value类型时，此时可以不必知晓序列化前的数据类型，代码如下。

复制代码

package readHadoopFile;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.Reader;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;
import org.apache.mahout.common.HadoopUtil;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.hadoop.similarity.cooccurrence.Vectors;
public class transformFile {
public static void main(String[] args) throws IllegalArgumentException, IOException {
String path = null;
//读取Hadoop上的序列化文件
path = "~\temp\partialMultiply2";
FileSystem fs=FileSystem.get(new Configuration());
Reader reader=new SequenceFile.Reader(fs.getConf(), Reader.file(new Path(path)));
Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), fs.getConf());
Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), fs.getConf());
System.out.println(key);
File file = new File("~\temp\partialMultiply2.trans");
FileWriter fw = new FileWriter(file);
String kv = null;
//将反序列化后的值写到另一个文件里
while(reader.next(key,value)) {
kv=key.toString()+"="+value.toString()+"n";
fw.write(kv);
}
fw.close();
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
package readHadoopFile;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.Reader;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;
import org.apache.mahout.common.HadoopUtil;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.hadoop.similarity.cooccurrence.Vectors;
public class transformFile {
public static void main(String[] args) throws IllegalArgumentException, IOException {
String path = null;
//读取Hadoop上的序列化文件
path = "~\temp\partialMultiply2";
FileSystem fs=FileSystem.get(new Configuration());
Reader reader=new SequenceFile.Reader(fs.getConf(), Reader.file(new Path(path)));
Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), fs.getConf());
Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), fs.getConf());
System.out.println(key);
File file = new File("~\temp\partialMultiply2.trans");
FileWriter fw = new FileWriter(file);
String kv = null;
//将反序列化后的值写到另一个文件里
while(reader.next(key,value)) {
kv=key.toString()+"="+value.toString()+"n";
fw.write(kv);
}
fw.close();
}
}