1 Introduction
Great success has been achieved in obtaining powerful discriminative classifiers via supervised training, such as decision trees
C4.5 vapnik1995nature , neural networks CNN , boosting AdaBoost, and random forests
breiman2001random . However, recent studies reveal that even modern classifiers like deep convolutional neural networks krizhevsky2012imagenet still make mistakes that look absurd to humans goodfellow2014explaining . A common way to improve the classification performance is by using more data, in particular “hard examples”, to train the classifier. Different types of approaches have been proposed in the past including bootstrapping mooney1993bootstrapping settles2010active zhu2005semi , and data augmentation krizhevsky2012imagenet . However, the approaches above utilize data samples that are either already present in the given training set, or additionally created by humans or separate algorithms.In this paper, we focus on improving convolutional neural networks by endowing them with synthesis capabilities to make them internally generative. In the past, attempts have been made to build connections between generative models and discriminative classifiers friedman2001elements ; liang2008asymptotic ; tu2008brain ; jebara2012machine . In welling2002self , a self supervised boosting algorithm was proposed to train a boosting algorithm by sequentially learning weak classifiers using the given data and selfgenerated negative samples; the generative via discriminative learning work in tu2007learning generalizes the concept that unsupervised generative modeling can be accomplished by learning a sequence of discriminative classifiers via selfgenerated pseudonegatives. Inspired by welling2002self ; tu2007learning
in which selfgenerated samples are utilized, as well as recent success in deep learning
krizhevsky2012imagenet ; gatys2015neural , we propose here an introspective convolutional network (ICN) classifier and study how its internal generative aspect can benefit CNN’s discriminative classification task. There is a recent line of work using a discriminator to help with an external generator, generative adversarial networks (GAN) goodfellow2014generative , which is different from our objective here. We aim at building a single CNN model that is simultaneously discriminative and generative.The introspective convolutional networks (ICN) being introduced here have a number of properties. (1) We introduce introspection to convolutional neural networks and show its significance in supervised classification. (2) A reclassificationbysynthesis algorithm is devised to train ICN by iteratively augmenting the negative samples and updating the classifier. (3) A stochastic gradient descent sampling process is adopted to perform efficient synthesis for ICN. (4) We propose a supervised formulation to directly train a multiclass ICN classifier. We show consistent improvement over stateoftheart CNN classifiers (ResNet
he2016deep ) on benchmark datasets in the experiments.2 Related work
Our ICN method is directly related to the generative via discriminative learning framework tu2007learning . It also has connection to the selfsupervised learning method welling2002self
, which is focused on density estimation by combining weak classifiers. Previous algorithms connecting generative modeling with discriminative classification
friedman2001elements ; liang2008asymptotic ; tu2008brain ; jebara2012machine fall in the category of hybrid models that are direct combinations of the two. Some existing works on introspective learning leake2012introspective ; brock2016neural ; sinha2017introspection have a different scope to the problem being tackled here. Other generative modeling schemes such as MiniMax entropy zhu1997minimax , inducing features della1997inducing , autoencoder baldi2012autoencoders , and recent CNNbased generative modeling approaches xie2016theory ; xie2016cooperative are not for discriminative classification and they do not have a single model that is both generative and discriminative. Below we discuss the two methods most related to ICN, namely generative via discriminative learning (GDL) tu2007learning and generative adversarial networks (GAN) goodfellow2014generative .Relationship with generative via discriminative learning (GDL) tu2007learning
ICN is largely inspired by GDL and it follows a similar pipeline developed in tu2007learning . However, there is also a large improvement of ICN to GDL, which is summarized below.

CNN vs. Boosting. ICN builds on top of convolutional neural networks (CNN) by explicitly revealing the introspectiveness of CNN whereas GDL adopts the boosting algorithm AdaBoost .

Supervised classification vs. unsupervised modeling. ICN focuses on the supervised classification task with competitive results on benchmark datasets whereas GDL was originally applied to generative modeling and its power for the classification task itself was not addressed.

SGD sampling vs. Gibbs sampling
. ICN carries efficient SGD sampling for synthesis through backpropagation which is much more efficient than the Gibbs sampling strategy used in GDL.

Single CNN vs. Cascade of classifiers. ICN maintains a single CNN classifier whereas GDL consists of a sequence of boosting classifiers.

Automatic feature learning vs. manually specified features. ICN has greater representational power due to the endtoend training of CNN whereas GDL relies on manually designed features.
Comparison with Generative Adversarial Networks (GANs) goodfellow2014generative
Recent efforts in adversarial learning goodfellow2014generative are also very interesting and worth comparing with.

Introspective vs. adversarial. ICN emphasizes being introspective by synthesizing samples from its own classifier while GAN focuses on adversarial — using a distinct discriminator to guide the generator.

Supervised classification vs. unsupervised modeling
. The main focus of ICN is to develop a classifier with introspection to improve the supervised classification task whereas GAN is mostly for building highquality generative models under unsupervised learning.

Single model vs. two separate models. ICN retains a CNN discriminator that is itself a generator whereas GAN maintains two models, a generator and a discriminator, with the discriminator in GAN trained to classify between “real” (given) and “fake” (generated by the generator) samples.

Reclassificationbysynthesis vs. minimax. ICN engages an iterative procedure, reclassificationbysynthesis, stemmed from the Bayes theory whereas GAN has a minimax objective function to optimize. Training an ICN classifier is the same as that for the standard CNN.

Multiclass formulation. In a GANfamily work salimans2016improved , a semisupervised learning task is devised by adding an additional “notreal” class to the standard k classes in multiclass classification; this results in a different setting to the standard multiclass classification with additional model parameters. ICN instead, aims directly at the supervised multiclass classification task by maintaining the same parameter setting within the softmax function without additional model parameters.
Later developments alongside GAN radford2015unsupervised ; salimans2016improved ; zhao2016energy ; brock2016neural share some similar aspects to GAN, which also do not achieve the same goal as ICN does. Since the discriminator in GAN is not meant to perform the generic twoclass/multiclass classification task, some special settings for semisupervised learning goodfellow2014generative ; radford2015unsupervised ; zhao2016energy ; brock2016neural ; salimans2016improved were created. ICN instead has a single model that is both generative and discriminative, and thus, an improvement to ICN’s generator leads to a direct means to ameliorate its discriminator. Other work like goodfellow2014explaining was motivated from an observation that adding small perturbations to an image leads to classification errors that are absurd to humans; their approach is however taken by augmenting positive samples from existing input whereas ICN is able to synthesize new samples from scratch. A recent work proposed in Lazarow2015intro is in the same family of ICN, but Lazarow2015intro focuses on unsupervised image modeling using a cascade of CNNs.
3 Method
The pipeline of ICN is shown in Figure 1, which has an immediate improvement over GDL (tu2007learning, ) in several aspects that have been described in the previous section. One particular gain of ICN is its representation power and efficient sampling process through backpropagation as a variational sampling strategy.
3.1 Formulation
We start the discussion by introducing the basic formulation and borrow the notation from (tu2007learning, ). Let be a data sample (vector) and be its label, indicating either a negative or a positive sample (in multiclass classification ). We study binary classification first. A discriminative classifier computes
, the probability of
being positive or negative. . A generative model instead models , which captures the underlying generation process of for class . In binary classification, positive samples are of primary interest. Under the Bayes rule:(1) 
which can be further simplified when assuming equal priors :
(2) 
We make two interesting and important observations from Eqn. (2): 1) is dependent on the faithfulness of , and 2) a classifier to report can be made simultaneously generative and discriminative. However, there is a requirement: having an informative distribution for the negatives such that samples drawn have good coverage to the entire space of , especially for samples that are close to the positives , to allow the classifier to faithfully learn . There seems to exist a dilemma. In supervised learning, we are only given a set of limited amount of training data, and a classifier is only focused on the decision boundary to separate the given samples and the classification on the unseen data may not be accurate. This can be seen from the top left plot in Figure 1. This motivates us to implement the synthesis part within learning — make a learned discriminative classifier generate samples that pass its own classification and see how different these generated samples are to the given positive samples. This allows us to attain a single model that has two aspects at the same time: a generative model for the positive samples and an improved classifier for the classification.
Suppose we are given a training set and and . One can directly train a discriminative classifier , e.g. a convolutional neural networks CNN to learn , which is always an approximation due to various reasons including insufficient training samples, generalization error, and classifier limitations. Previous attempts to improve classification by data augmentation were mostly done to add more positive samples krizhevsky2012imagenet ; goodfellow2014explaining ; we instead argue the importance of adding more negative samples to improve the classification performance. The dilemma is that is limited to the given data. For clarity, we now use to represent . Our goal is to augment the negative training set by generating confusing pseudonegatives to improve the classification (note that in the end pseudonegative samples drawn will become hard to distinguish from the given positive samples. Crossvalidation can be used to determine when using more pseudonegatives is not reducing the validation error). We call the samples drawn from pseudonegatives (defined in (tu2007learning, )). We expand by , where and for
includes all the pseudonegative samples selfgenerated from our model up to time . indicates the number of pseudonegatives generated at each round. We define a reference distribution , where
is a Gaussian distribution (e.g.
independently). We carry out learning with to iteratively obtain and by updating classifier on . The initial classifier on reports discriminative probability . The reason for using is because it is an approximation to the true due to limited samples drawn in . At each time , we then compute(3) 
where . Draw new samples to expand the pseudonegative set:
(4) 
We name the specific training algorithm for our introspective convolutional network (ICN) classifier reclassificationbysynthesis, which is described in Algorithm 1. We adopt convolutional neural networks (CNN) classifier to build an endtoend learning framework with an efficient sampling process (to be discussed in the next section).
3.2 Reclassificationbysynthesis
We present our reclassificationbysynthesis algorithm for ICN in this section. A schematic illustration is shown in Figure 1. A single CNN classifier is being trained progressively which is simultaneously a discriminator and a generator. With the pseudonegatives being gradually generated, the classification boundary gets tightened, and hence yields an improvement to the classifier’s performance. The reclassificationbysynthesis method is described in Algorithm 1. The key to the algorithm includes two steps: (1) reclassificationstep, and (2) synthesisstep, which will be discussed in detail below.
3.2.1 Reclassificationstep
The reclassificationstep can be viewed as training a normal classifier on the training set where and . for . We use CNN as our base classifier. When training a classifier on , we denote the parameters to be learned in by a highdimensional vector which might consist of millions of parameters. denotes the weights of the top layer combining the features and
carries all the internal representations. Without loss of generality, we assume a sigmoid function for the discriminative probability
where
defines the feature extraction function for
. Both and can be learned by the standard stochastic gradient descent algorithm via backpropagation to minimize a crossentropy loss with an additional term on the pseudonegatives:(5) 
3.2.2 Synthesisstep
In the reclassification step, we obtain which is then used to update according to Eqn. (3):
(6) 
In the synthesisstep, our goal is to draw fair samples from (fair samples refer to typical samples by a sampling process after convergence w.r.t the target distribution). In tu2007learning
, various Markov chain Monte Carlo techniques
Jliuincluding Gibbs sampling and Iterated Conditional Modes (ICM) have been adopted, which are often slow. Motivated by the DeepDream code
mordvintsev2016deepdream and Neural Artistic Style work gatys2015neural , we update a random sample drawn from by increasing using backpropagation. Note that the partition function (normalization) is a constant that is not dependent on the sample . Let(7) 
and take its
, which is nicely turned into the logit of
(8) 
Starting from drawn from , we directly increase using stochastic gradient ascent on via backpropagation, which allows us to obtain fair samples subject to Eqn. (6). Gaussian noise can be added to Eqn. (8) along the line of stochastic gradient Langevin dynamics welling2011bayesian as
where is a Gaussian distribution and is the step size that is annealed in the sampling process.
Sampling strategies. When conducting experiments, we carry out several strategies using stochastic gradient descent algorithm (SGD) and SGD Lagenvin including: i) earlystopping for the sampling process after
becomes positive (aligned with contrastive divergence
carreira2005contrastive where a short Markov chain is simulated); ii) stopping at a large confidence for being positive, and iii) sampling for a fixed, large number of steps. Table 2 shows the results on these different options and no major differences in the classification performance are observed.Building connections between SGD and MCMC is an active area in machine learning
welling2011bayesian ; chen2014stochastic ; mandt2017stochastic . In welling2011bayesian , combining SGD and additional Gaussian noise under annealed stepsize results in a simulation of Langevin dynamics MCMC. A recent work mandt2017stochastic further shows the similarity between constant SGD and MCMC, along with analysis of SGD using momentum updates. Our progressively learned discriminative classifier can be viewed as carving out the feature space on , which essentially becomes an equivalent class for the positives; the volume of the equivalent class that satisfies the condition is exponentially large, as analyzed in wu2000equivalence . The probability landscape of positives (equivalent class) makes our SGD sampling process not particularly biased towards a small limited modes. Results in Figure 2 illustrates that large variation of the sampled/synthesized examples.3.3 Analysis
The convergence of can be derived (see the supplementary material), inspired by the proof from tu2007learning : where
denotes the KullbackLeibler divergence and
, under the assumption that classifier at improves over .Remark. Here we pay particular attention to the negative samples which live in a space that is often much larger than the positive sample space. For the negative training samples, we have and , where is a distribution on the given negative examples in the original training set. Our reclassificationbysynthesis algorithm (Algorithm 1) essentially constructs a mixture model by sequentially generating pseudonegative samples to augment our training set. Our new distribution for augmented negative sample set thus becomes , where encodes pseudonegative samples that are confusing and similar to (but are not) the positives. In the end, adding pseudonegatives might degrade the classification result since they become more and more similar to the positives. Crossvalidation can be used to decide when adding more pseudonegatives is not helping the classification task. How to better use the pseudonegative samples that are increasingly faithful to the positives is an interesting topic worth further exploring. Our overall algorithm thus is capable of enhancing classification by selfgenerating confusing samples to improve CNN’s robustness.
3.4 Multiclass classification
Onevsall. In the above section, we discussed the binary classification case. When dealing with multiclass classification problems, such as MNIST and CIFAR10, we will need to adapt our proposed reclassificationbysynthesis scheme to the multiclass case. This can be done directly using a onevsall strategy by training a binary classifier using the th class as the positive class and then combine the rest classes into the negative class, resulting in a total of binary classifiers. The training procedure then becomes identical to the binary classification case. If we have classes, then the algorithm will train individual binary classifiers with
The prediction function is simply
The advantage of using the onevsall strategy is that the algorithm can be made nearly identical to the binary case at the price of training different neural networks.
Softmax function. It is also desirable to build a single CNN classifier to perform multiclass classification directly. Here we propose a formulation to train an endtoend multiclass classifier directly. Since we are directly dealing with classes, the pseudonegative data set will be slightly different and we introduce negatives for each individual class by and:
Suppose we are given a training set and and . We want to train a single CNN classifier with
where denotes the internal feature and parameters for the single CNN, and denotes the toplayer weights for the th class. We therefore minimize an integrated objective function
(9) 
The first term in Eqn. (9) encourages a softmax loss on the original training set . The second term in Eqn. (9) encourages a good prediction on the individual pseudonegative class generated for the th class (indexed by for , e.g. for pseudonegative samples belong to the th class, ).
is a hyperparameter balancing the two terms. Note that we only need to build a single CNN sharing
for all the classes. In particular, we are not introducing additional model parameters here and we perform a direct class classification where the parameter setting is identical to a standard CNN multiclass classification task; to compare, an additional “notreal” class is created in salimans2016improved and the classification task there salimans2016improved thus becomes a class classification.4 Experiments
We conduct experiments on three standard benchmark datasets, including MNIST, CIFAR10 and SVHN. We use MNIST as a running example to illustrate our proposed framework using a shallow CNN; we then show competitive results using a stateoftheart CNN classifier, ResNet he2016deep on MNIST, CIFAR10 and SVHN. In our experiments, for the reclassification step, we use the SGD optimizer with minibatch size of 64 (MNIST) or 128 (CIFAR10 and SVHN) and momentum equal to 0.9; for the synthesis step, we use the Adam optimizer kingma2014adam with momentum term equal to 0.5. All results are obtained by averaging multiple rounds.
Training and test time. In general, the training time for ICN is around double that of the baseline CNNs in our experiments: 1.8 times for MNIST dataset, 2.1 times for CIFAR10 dataset and 1.7 times for SVHN dataset. The added overhead in training is mostly determined by the number of generated pseudonegative samples. For the test time, ICN introduces no additional overhead to the baseline CNNs.
4.1 Mnist
Method  Onevsall ()  Softmax () 
DBN    
CNN (baseline)  
CNN w/ LS    
CNN + GDL    
CNN + DCGAN    
ICNnoise (ours)  
ICN (ours) 
Test errors on the MNIST dataset. We compare our ICN method with the baseline CNN, Deep Belief Network (DBN)
hinton2006fast , and CNN w/ Label Smoothing (LS) Christian2016ls . Moreover, the twostep experiments combining CNN + GDL tu2007learning and combining CNN + DCGAN radford2015unsupervised are also reported, and see descriptions in text for more details.We use the standard MNIST lecun1998mnist dataset, which consists of training, validation and test samples. We adopt a simple network, containing 4 convolutional layers, each having a filter size with , , and
channels, respectively. These convolutional layers have stride 2, and no pooling layers are used. LeakyReLU activations
maas2013rectifier are used after each convolutional layer. The last convolutional layer is flattened and fed into a sigmoid output (in the onevsall case).In the reclassification step, we run SGD (for epochs) on the current training data , including previously generated pseudonegatives. Our initial learning rate is and is decreased by a factor of at . In the synthesis step, we use the backpropagation sampling process as discussed in Section 3.2.2. In Table 2, we compare different sampling strategies. Each time we synthesize a fixed number ( in our experiments) of pseudonegative samples.
We show some synthesized pseudonegatives from the MNIST dataset in Figure 2. The samples in the top row are from the original training dataset. ICN gradually synthesizes pseudonegatives, which are increasingly faithful to the original data. Pseudonegative samples will be continuously used while improving the classification result.
Sampling Strategy  Onevsall ()  Softmax () 

SGD (option )  
SGD Langevin (option )  
SGD (option )  
SGD Langevin (option )  
SGD (option )  
SGD Langevin (option ) 
Comparison of different sampling strategies. We perform SGD and SGD Langevin (with injected Gaussians), and try several options via backpropagation for the sampling strategies. Option : earlystopping once the generated samples are classified as positive; option : stopping at a high confidence for samples being positive; option : stopping after a large number of steps. Table 2 shows the results and we do not observe significant differences in these choices.
Ablation study. We experiment using random noise as synthesized pseudonegatives in an ablation study. From Table 1, we observe that our ICN outperforms the CNN baseline and the ICNnoise method in both onevsall and softmax cases.
Effects on varying training sizes. To better understand the effectiveness of our ICN method, we carry out an experiment by varying the number of training examples. We use training sets with different sizes including , , , and examples. The results are reported in Figure 3. ICN is shown to be particularly effective when the training set is relatively small, since ICN has the capability to synthesize pseudonegatives by itself to aid training.
Comparison with GDL and GAN. GDL tu2007learning focuses on unsupervised learning; GAN goodfellow2014generative and DCGAN radford2015unsupervised show results for unsupervised learning and semisupervised classification. To apply GDL and GAN to the supervised classification setting, we design an experiment to perform a twostep implementation. For GDL, we ran the GDL code tu2007learning and obtained the pseudonegative samples for each individual digit; the pseudonegatives are then used as augmented negative samples to train individual onevsall CNN classifiers (using an identical CNN architecture to ICN for a fair comparison), which are combined to form a multiclass classifier in the end. To compare with DCGAN radford2015unsupervised , we follow the same procedure: each generator trained by DCGAN radford2015unsupervised
using the TensorFlow implementation
dcgantensorflow was used to generate positive samples, which are then augmented to the negative set to train the individual onevsall CNN classifiers (also using an identical CNN architecture to ICN), which are combined to create the overall multiclass classifier. CNN+GDL achieves a test error of and CNN+DCGAN achieves a test error of on the MNIST dataset, whereas ICN reports an error of using the same CNN architecture. As the supervised learning task was not directly specified in DCGAN radford2015unsupervised , some care is needed to design the optimal setting to utilize the generated samples from DCGAN in the twostep approach (we made attempts to optimize the results). GDL tu2007learning can be made into a discriminative classifier by utilizing the given negative samples first but boosting AdaBoost with manually designed features was adopted which may not produce competitive results as CNN classifier does. Nevertheless, the advantage of ICN being an integrated endtoend supervised learning singlemodel framework can be observed.To compare with generative model based deep learning approach, we report the classification result of DBN hinton2006fast in Table 1. DBN achieves a test error of using the softmax function. We also compare with Label Smoothing (LS), which has been used in Christian2016ls as a regularization technique by encouraging the model to be less confident. In LS, for a training example with groundtruth label, the label distribution is replaced with a mixture of the original groundtruth distribution and a fixed distribution. LS achieves a test error of in the softmax case.
In addition, we also adopt ResNet32 he2016identity (using the softmax function) as another baseline CNN model, which achieves a test error of on the MNIST dataset. Our ResNet32 based ICN achieves an improved result of .
Robustness to external adversarial examples. To show the improved robustness of ICN in dealing with confusing and challenging examples, we compare the baseline CNN with our ICN classifier on adversarial examples generated using the “fast gradient sign” method from goodfellow2014explaining . This “fast gradient sign” method (with ) can cause a maxout network to misclassify of adversarial examples generated from the MNIST test set goodfellow2014explaining . In our experiment, we set . Starting with MNIST test examples, we first determine those which are correctly classified by the baseline CNN in order to generate adversarial examples from them. We find that generated adversarial examples successfully fool the baseline CNN, however, only of these examples can fool our ICN classifier, which is a reduction in error against adversarial examples. Note that the improvement is achieved without using any additional training data, nor knowing a prior about how these adversarial examples are generated by the specific “fast gradient sign method” goodfellow2014explaining . On the contrary, of the adversarial examples generated from the ICN classifier side that fool ICN using the same method, of them can still fool the baseline CNN classifier. This twoway experiment shows the improved robustness of ICN over the baseline CNN.
4.2 Cifar10
Method  Onevsall ()  Softmax () 
w/o Data Augmentation  
Convolutional DBN    
ResNet32 (baseline)  
ResNet32 w/ LS    
ResNet32 + DCGAN    
ICNnoise (ours)  
ICN (ours)  
w/ Data Augmentation  
ResNet32 (baseline)  
ResNet32 w/ LS    
ResNet32 + DCGAN    
ICNnoise (ours)  
ICN (ours) 
The CIFAR10 dataset krizhevsky2009learning consists of color images of size . This set of images is split into two sets, images for training and images for testing. We adopt ResNet he2016identity as our baseline model tensorpack . For data augmentation, we follow the standard procedure in DSN ; lee2016generalizing ; he2016identity
by augmenting the dataset by zeropadding 4 pixels on each side; we also perform cropping and random flipping. The results are reported in Table
3. In both onevsall and softmax cases, ICN outperforms the baseline ResNet classifiers. Our proposed ICN method is orthogonal to many existing approaches which use various improvements to the network structures in order to enhance the CNN performance. We also compare ICN with Convolutional DBN krizhevsky2010convolutional , ResNet32 w/ Label Smoothing (LS) Christian2016ls and ResNet32+DCGAN radford2015unsupervised methods as described in the MNIST experiments. LS is shown to improve the baseline but is worse than our ICN method in most cases except for the MNIST dataset.4.3 Svhn
Method  Softmax () 

ResNet32 (baseline)  
ResNet32 w/ LS  
ResNet32 + DCGAN  
ICNnoise (ours)  
ICN (ours) 
We use the standard SVHN netzer2011reading dataset. We combine the training data with the extra data to form our training set and use the test data as the test set. No data augmentation has been applied. The result is reported in Table 4. ICN is shown to achieve competitive results.
5 Conclusion
In this paper, we have proposed an introspective convolutional nets (ICN) algorithm that performs internal introspection. We observe performance gains within supervised learning using stateoftheart CNN architectures on standard machine learning benchmarks.
Acknowledgement This work is supported by NSF IIS1618477, NSF IIS1717431, and a Northrop Grumman Contextual Robotics grant. We thank Saining Xie, Weijian Xu, Fan Fan, Kwonjoon Lee, Shuai Tang, and Sanjoy Dasgupta for helpful discussions.
References

(1)
P. Baldi.
Autoencoders, unsupervised learning, and deep architectures.
In
ICML Workshop on Unsupervised and Transfer Learning
, pages 37–49, 2012.  (2) L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 (3) A. Brock, T. Lim, J. Ritchie, and N. Weston. Neural photo editing with introspective adversarial networks. In ICLR, 2017.
 (4) M. A. CarreiraPerpinan and G. Hinton. On contrastive divergence learning. In AISTATS, volume 10, pages 33–40, 2005.
 (5) T. Chen, E. B. Fox, and C. Guestrin. Stochastic gradient hamiltonian monte carlo. In ICML, 2014.
 (6) S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random fields. IEEE transactions on pattern analysis and machine intelligence, 19(4):380–393, 1997.
 (7) Y. Freund and R. E. Schapire. A Decisiontheoretic Generalization of Online Learning And An Application to Boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
 (8) J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics Springer, Berlin, 2001.
 (9) L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
 (10) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
 (11) I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
 (12) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.

(13)
K. He, X. Zhang, S. Ren, and J. Sun.
Identity mappings in deep residual networks.
In
European Conference on Computer Vision
, pages 630–645. Springer, 2016.  (14) G. E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
 (15) T. Jebara. Machine learning: discriminative and generative, volume 755. Springer Science & Business Media, 2012.
 (16) T. Kim. DCGANtensorflow. https://github.com/carpedm20/DCGANtensorflow.
 (17) D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 (18) A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. CS Dept., U Toronto, Tech. Rep., 2009.
 (19) A. Krizhevsky and G. Hinton. Convolutional deep belief networks on cifar10. Unpublished manuscript, 40, 2010.
 (20) A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012.
 (21) J. Lazarow, L. Jin, and Z. Tu. Introspective neural networks for generative modeling. In ICCV, 2017.
 (22) D. B. Leake. Introspective learning and reasoning. In Encyclopedia of the Sciences of Learning, pages 1638–1640. Springer, 2012.
 (23) Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropagation applied to handwritten zip code recognition. In Neural Computation, 1989.

(24)
Y. LeCun and C. Cortes.
The MNIST database of handwritten digits, 1998.
 (25) C.Y. Lee, P. W. Gallagher, and Z. Tu. Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In AISTATS, 2016.
 (26) C.Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. In AISTATS, 2015.

(27)
P. Liang and M. I. Jordan.
An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators.
In ICML, 2008.  (28) J. S. Liu. Monte Carlo strategies in scientific computing. Springer Science & Business Media, 2008.
 (29) A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013.
 (30) S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.
 (31) C. Z. Mooney, R. D. Duval, and R. Duvall. Bootstrapping: A nonparametric approach to statistical inference. Number 9495. Sage, 1993.
 (32) A. Mordvintsev, C. Olah, and M. Tyka. Deepdream  a code example for visualizing neural networks. Google Research, 2015.
 (33) Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading Digits in Natural Images with Unsupervised Feature Learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

(34)
J. R. Quinlan.
Improved use of continuous attributes in c4. 5.
Journal of artificial intelligence research
, 4:77–90, 1996.  (35) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
 (36) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016.
 (37) B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(5566):11, 2010.
 (38) A. Sinha, M. Sarkar, A. Mukherjee, and B. Krishnamurthy. Introspection: Accelerating neural network training by learning weight evolution. arXiv preprint arXiv:1704.04959, 2017.
 (39) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
 (40) Z. Tu. Learning generative models via discriminative approaches. In CVPR, 2007.
 (41) Z. Tu, K. L. Narr, P. Dollár, I. Dinov, P. M. Thompson, and A. W. Toga. Brain anatomical structure segmentation by hybrid discriminative/generative models. Medical Imaging, IEEE Transactions on, 27(4):495–508, 2008.

(42)
V. N. Vapnik.
The nature of statistical learning theory
. SpringerVerlag New York, Inc., 1995.  (43) M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In ICML, 2011.
 (44) M. Welling, R. S. Zemel, and G. E. Hinton. Self supervised boosting. In NIPS, 2002.
 (45) Y. Wu. Tensorpack toolbox. https://github.com/ppwwyyxx/tensorpack/tree/master/examples/ResNet.
 (46) Y. N. Wu, S. C. Zhu, and X. Liu. Equivalence of julesz ensembles and frame models. International Journal of Computer Vision, 38(3), 2000.
 (47) J. Xie, Y. Lu, S.C. Zhu, and Y. N. Wu. Cooperative training of descriptor and generator networks. arXiv preprint arXiv:1609.09408, 2016.
 (48) J. Xie, Y. Lu, S.C. Zhu, and Y. N. Wu. A theory of generative convnet. In ICML, 2016.
 (49) J. Zhao, M. Mathieu, and Y. LeCun. Energybased generative adversarial network. In ICLR, 2017.
 (50) S. C. Zhu, Y. N. Wu, and D. Mumford. Minimax entropy principle and its application to texture modeling. Neural Computation, 9(8):1627–1660, 1997.
 (51) X. Zhu. Semisupervised learning literature survey. Computer Science, University of WisconsinMadison, Technical Report 1530, 2005.
6 Supplementary material
6.1 Proof of the convergence of
The convergence of can be derived, inspired by the proof from tu2007learning : where denotes the KullbackLeibler divergence and , under the assumption that classifier at improves over .
Proof:
and
where
Since and . Given the training data and the previously generated pseudonegative samples are all retained in each step, we assume that the classifier at improves over that at . This shows that converges to and the convergence rate depends on the classification error at each step.
Comments
There are no comments yet.